Some Bibliographical References on
Neuro-Dynamic Programming
 (Reinforcement Learning)
Applications

 

 

·        Theoretical

·        Finances

·        Inventory management

·        Semiconductor Manufacturing

·        Telecommunications

·        Games and Strategy

·        Artificial Intelligence and NDP applications

·        Bayesian Networks

·        Internet

·        Optimal Stopping

·        Autonomous vehicles

·        Motion Planning

·        Military

·        Industrial applications

·        Other: Combinatorial Optimization, Maintenance and Repair, Dynamic Channel Allocation, Job-Shop Scheduling, etc.

 

 

 

 

 

Theoretical applications

 

Temporal Differences–Based Policy Iteration and Applications in Neuro–Dynamic Programming.

D. P. Bertsekas and S. Ioffe,

Technical Report LIDS–P–2349, MIT Laboratory for Information and Decision Systems, 1996.

 

Abstract:  a new policy iteration method for dynamic programming problems with discounted and undiscounted cost is considered, based on the notion of temporal differences, and is primarily geared to the case of large and complex problems where the use of approximations is essential.  It is describe how to embed them within neuro-dynamic programming/reinforcement learning.  An example in training a tetris playing program is given.

 

 

 

Reinforcement Learning Algorithms for Average Payoff Markovian Decision Processes.

S. P. Singh.

In Proceedings of the Twelfth National

Conference on Artificial Intelligence, pages 202–207, 1994.

 

Abstract:.  A new average-payoff RL algorithms are presented as stochastic approximation methods for solving the system of equations associated with the policy evaluation and optimal control questions in average-payoff RL tasks.  These algorithms already developed for the discounted-payoff case.  Preliminary empirical results are presented to validate these new algorithms.

 

 

Differential Training of Rollout Policies
D. P Bertsekas
Proc. of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., October 1997

Abstract:  An approximate solution of stochastic optimal control problems using a neuro-dynamic programming/reinforcement learning methodology is presented. The paper is focused on the computation of a rollout policy, which is obtained by a single policy iteration starting from some known base policy and using some form of exact or approximate policy improvement. In particular, a method, called differential training, that can be used to obtain an approximation to cost-to-go differences rather than cost-to-go values is presented by using standard methods such as TD(lambda) and lambda-policy iteration. This method is suitable for recursively generating rollout policies in the context of simulation-based policy iteration methods.

 

 

 

 

Finances:

Regression Methods for Pricing Complex American--Style Options
B. Van Roy and J. N. Tsitsiklis,
IEEE Trans. on Neural Networks, Vol. 12, No. 4, July 2001, pp. 694-703.

Abstract:  a simulation-based approximate dynamic programming method for pricing complex American-style options, with possibly high-dimensional underlying state space is presented. 

 

Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms, and an Application to Pricing Financial Derivatives
J. N. Tsitsiklis and B. Van Roy,
IEEE Transactions on Automatic Control; Vol. 44, No. 10, October 1999, pp. 1840-1851.

Abstract: a develop theory characterizing optimal stopping times for discrete-time ergodic Markov processes with discount rewards is presented.  It is proposed a stochastic approximation algorithm that tunes weights of a linear combination of basis functions in order to approximate a value function.  The utility of the approximation method is illustrated via a computational case study involving the pricing of a path-dependant financial derivative security that gives rise to an optimal stopping problem with a one-hundred-dimensional state space.

 

 

Inventory management:

A Neuro-Dynamic Programming Approach to Retailer Inventory Management,
B. Van Roy, D. P. Bertsekas, Y. Lee, and J. N. Tsitsiklis,
for Information and Decision Systems Report, Nov. 1996.
 

Abstract: an application of neuro-dynamic programming techniques to the optimization of retailer inventory systems.  A specific case study involving a model with thirty-three state variables is presented. The performance of solutions generated by neuro-dynamic programming algorithms is compare with the delivered by optimized s-type ("order-up-to") policies. The NDP algorithm is able to generate control strategies substantially superior, reducing inventory costs by approximately ten percent.

 

 

Semiconductor Manufacturing

Optimal etch time control design using neuro-dynamic programming
Lei Yang; Si, J.
Semiconductor Manufacturing Symposium, 2001 IEEE International, 2001
Page(s): 75 -78.

Abstract: An application to design the optimal etch time control system for a reactive ion etch process is presented. A predictive neural network model is built representing the relation between some state variables and the resulting thickness remain. The NDP is employed to determine the optimal etch time based on the predictive film thickness remain model. Simulation results show that NDP is a viable learning optimization tool.

 

Telecommunications

Call admission control and routing in integrated services networks using neuro-dynamic programming
Marbach, P.; Mihatsch, O.; Tsitsiklis, J.N.
Selected Areas in Communications, IEEE Journal on , Volume: 18 Issue: 2 , Feb. 2000 Page(s): 197 -208.

Abstract: The problem of maximizing the average value of admitted calls per unit time (or of revenue maximization) is naturally formulated as a dynamic programming problem, but is too complex to allow for an exact solution.  Then NDP is used together with a decomposition approach, to construct dynamic (state-dependent) call admission control and routing policies. These policies are based on state-dependent link costs, and a simulation-based learning method is employed to tune the parameters that define these link costs.

 

Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems.

S. P. Singh and D. P. Bertsekas,
In Advances in Neural Information Processing Systems 10, Cambridge, MA, 1997. MIT Press.

 

Demo online - animated  (applet): http://envy.cs.umass.edu/People/singh/Demo.html

 

Abstract: Application to dynamically allocate the communication resource (channels) so as to maximize service in a stochastic caller environment.  This problem is naturally formulated as a dynamic programming problem. Reinforcement learning (RL) method is used to find dynamic channel allocation policies that are better than previous heuristic solutions.  The policies obtained perform well for a broad variety of call traffic patterns.  Results on a large cellular system with approximately 7049 states are presented.

 

 

A neuro-dynamic programming approach to admission control in ATM networks: the single link case
Marbach, P.; Tsitsiklis, J.N.
Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on , Volume: 1 , 1997 Page(s): 159 -162 vol.1

Abstract:  .It is show how neuro-dynamic programming can be applied to the admission control problem for a single link in an ATM environment. Based on results obtained through neuro-dynamic programming,  a heuristic "threshold" policy is derived. Performances of the policies obtained through neuro-dynamic programming are compared with a policy which always accepts a customer when the required resources are available.

 

 

Optimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning (1999).
Timothy X Brown, Hui Tong, Satinder Singh,
NIPS , 1998.


Abstract: This paper examines the application of reinforcement learning to a telecommunications networking problem. The problem requires that revenue be maximized while simultaneously meeting a quality of service constraint that forbids entry into certain states.

 

 

 

Games and Strategy

Play Selection In American Football: A Case Study In Neuro-Dynamic Programming.
Stephen D. Patek, Dimitri P. Bertsekas.

Abstract:  the problem of play selection in American football as a stochastic shortest path Markov Decision Problem (MDP) is presented. In particular, it is considered the problem faced by a quarterback in attempting to maximize the net score of an offensive drive.

 

Strategy acquisition for the game "Othello" based on reinforcement learning.

Taku Yoshioka, Shin Ishii, Minoru Ito.

 

Abstract:  This article discusses automatic strategy acquisition for the game "Othello" based on reinforcement learning. A case is presented when two computer players initially know only the game rules, but they become relatively stronger after playing several thousands of games against each other. In each game, the players refine the evaluation function for the game state, which is achieved in a reinforcement learning manner. Since the state space is very large, we employ an RBF (Radial Basis Functions) neural network to approximate the evaluation function. As a result, the players become strong enough to beat a player employing a heuristic strategy.

Play Othello at: http://www.rainfall.com/othello/default.asp

 

 

Artificial Intelligence and NDP applications

Genetic algorithms and neuro-dynamic programming: application to water supply networks
Damas, M.; Salmeron, M.; Diaz, A.; Ortega, J.; Prieto, A.; Olivares, G.
Evolutionary Computation, 2000. Proceedings of the 2000 Congress on , Volume: 1 , 2000  Page(s): 7 -14 vol.1

Abstract: A hybrid genetic algorithm is used to determine the feasible functioning states in each stage. A procedure for series prediction based on Radial Basis Function (RBF) neural networks allows the uncertainty about state transitions to be avoided and Monte Carlo simulations are used to approximate the cost-to-go function. As an example, the proposed procedure is applied to a water supply network scheduling problem.

 

Bayesian Networks

Neuro-dynamic programming for adaptive control of Bayesian networks for global awareness
Ross, K.N.; Chaney, R.D.;
Patek, S.D.
Information Technology Conference, 1998. IEEE, 1998 Page(s): 10 -13

Abstract: In order to be efficient and robust, future systems for global awareness applications must accommodate new data sources and alter their computational structure automatically. The utility of neuro-dynamic programming (NDP) for addressing such challenges in a real-time application is presented. A description of the algorithm implementation is presented for a simple information fusion system that uses NDP to allocate resources to Bayesian networks. The Bayesian networks use information from distributed sensors and databases to infer identity of vehicle clusters.

 

 

Internet

Using Reinforcement Learning to Spider the Web Efficiently,
Jason Rennie, Andrew Kachites McCallum.
Proc. 16th International Conf. on Machine Learning, 1999.

 

Abstract:  Consider the task of exploring the Web in order to find pages of a particular kind or on a particular topic. This task arises in the construction of search engines and Web knowledge bases. This paper argues that the creation of efficient web spiders is best framed and solved by reinforcement learning.  It is presented an algorithm for learning a value function that maps hyperlinks to future discounted reward using a naive Bayes text classifier.

 

 

Autonomous vehicles

An hybrid methodology for RL-based behavior coordination in a target following mission with an AUV
Carreras, M.; Yuh, J.; Batlle, J.
OCEANS, 2001. MTS/IEEE Conference and Exhibition , Volume: 4 , 2001
Page(s): 2666 -2672 vol.4

Abstract: Proposes a behavior-based scheme for high-level control of autonomous underwater vehicles (AUVs). Behavior state/action mapping is learnt by means of reinforcement learning (RL). A continuous Q-learning algorithm, implemented with a feed-forward neural network, is used. The behavior-based scheme attempts to fulfill simple missions in which several behaviors/tasks compete for the vehicle's control.

 

Autonomous Helicopter Control using Reinforcement Learning Policy Search Methods
J. Andrew Bagnell, Jeff C. Schneider,  2001.

Abstract: The problem of autonomous Helicopter Control is cast as a Partially Observed Markovian Decision Problem (POMDPs), an optimal control formalism. A optimal solution is found to such problem using RL.

 

 

 

Motion Planning

Robust Reinforcement Learning in Motion Planning.
With A.G. Barto, R.A. Grupen, and C.I. Connolly
In NIPS 6, 1994.

 

Abstract:  This paper presents a method that uses domain knowledge to reduce the number of failures during exploration for motion planning.  This method formulates the set of actions from which the RL agent composes a control policy to ensure that exploration is conducted in a police space that excludes most of the unacceptable policies. 


Barto Papers in: http://www.cs.colorado.edu/~baveja/papers.html#applications

 

 

Military – defense and interceptor allocation

Missile defense and interceptor allocation by neuro-dynamic programming
Bertsekas, D.P.; Homer, M.L.; Logan, D.A.; Patek, S.D.; Sandell, N.R.
Systems, Man and Cybernetics, Part A, IEEE Transactions on , Volume: 30 Issue: 1 , Jan. 2000
Page(s): 42 -51

Abstract: Application to missile defense problem involving the sequential allocation of defensive resources over a series of engagements.  Given the problem is computationally intractable by exact methods because of its large number of states and its complex modeling issues NDP is used.  The cost-to-go function is approximated using neural network; different training methods are used and compared in its performance.

 

 

Industrial applications

Improving Elevator Performance Using Reinforcement Learning.
R. H. Crites and A. G. Barto,
In Advances in Neural Information Processing

Systems 8, Cambridge, MA, 1995. MIT Press.

Abstract: This paper describes the application of reinforcement learning (RL) to the problem of elevator dispatching. A team of RL agents is used, each of which is responsible for controlling one elevator car. The team receives a global reinforcement signal which appears noisy to each agent due to the effects of the actions of the other agents, the random nature of the arrivals and the incomplete observation of the state. In spite of these complications, it is show results that in simulation surpass the best of the heuristic elevator control algorithms used. These results demonstrate the power of RL on a very large scale stochastic dynamic optimization problems.

 

 

 

Other references:  Books

Cover Image                       

 

Neuro-Dynamic Programming

By Dimitri P. Bertsekas
and John Tsitsiklis
,
Athena Scientific, 1996.

Includes:

·        Case Studies

    • Parking
    • Football
    • Tetris
    • Combinatorial Optimization - Maintenance and Repair
    • Dynamic Channel Allocation
    • Backgammon
    • Notes and Sources

 

 

 

 

Reinforcement Learning:  An Introduction

By Richard S. Sutton
and Andrew G. Barto

MIT Press, Cambridge, MA, 1998
A
Bradford Book

Includes:


Complete electronic version of the book in:
http://www-anw.cs.umass.edu/~rich/book/the-book.html