On-line Policy Improvement using Monte-Carlo Search
Gerald Tesauro, Gregory R. Galperin
TL;DR
The paper addresses real-time policy improvement for adaptive controllers by introducing an on-line Monte-Carlo search that estimates $V_P(x,a)$ through simulated trajectories under a base policy $P$ and derives the improved policy as $P'(x)=\\arg\\max_{a} V_P(x,a)$. It highlights parallelizable computation and pruning strategies to manage CPU costs, making deep lookahead feasible in practice. Empirical results in backgammon show substantial reductions in base-player equity loss across weak to strong base policies, with single-layer networks approaching TD-Gammon strengths and multi-layer networks benefiting from truncated rollouts to achieve real-time performance. The study demonstrates the practicality of online Monte-Carlo policy improvement and outlines potential extensions to other reinforcement-learning control problems, including doubling decisions and alternative training schemes. These findings underscore the method's potential to surpass traditional offline policy iteration in domains where accurate simulators enable real-time decision making.
Abstract
We present a Monte-Carlo simulation algorithm for real-time policy improvement of an adaptive controller. In the Monte-Carlo simulation, the long-term expected reward of each possible action is statistically measured, using the initial policy to make decisions in each step of the simulation. The action maximizing the measured expected reward is then taken, resulting in an improved policy. Our algorithm is easily parallelizable and has been implemented on the IBM SP1 and SP2 parallel-RISC supercomputers. We have obtained promising initial results in applying this algorithm to the domain of backgammon. Results are reported for a wide variety of initial policies, ranging from a random policy to TD-Gammon, an extremely strong multi-layer neural network. In each case, the Monte-Carlo algorithm gives a substantial reduction, by as much as a factor of 5 or more, in the error rate of the base players. The algorithm is also potentially useful in many other adaptive control applications in which it is possible to simulate the environment.
