Table of Contents
Fetching ...

Continuous Monte Carlo Graph Search

Kalle Kujanpää, Amin Babadi, Yi Zhao, Juho Kannala, Alexander Ilin, Joni Pajarinen

TL;DR

Continuous Monte Carlo Graph Search (CMCGS), an extension of MCTS to online planning in environments with continuous state and action spaces, takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance.

Abstract

Online planning is crucial for high performance in many complex sequential decision-making tasks. Monte Carlo Tree Search (MCTS) employs a principled mechanism for trading off exploration for exploitation for efficient online planning, and it outperforms comparison methods in many discrete decision-making domains such as Go, Chess, and Shogi. Subsequently, extensions of MCTS to continuous domains have been developed. However, the inherent high branching factor and the resulting explosion of the search tree size are limiting the existing methods. To address this problem, we propose Continuous Monte Carlo Graph Search (CMCGS), an extension of MCTS to online planning in environments with continuous state and action spaces. CMCGS takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance. To implement this idea, at each time step, CMCGS clusters similar states into a limited number of stochastic action bandit nodes, which produce a layered directed graph instead of an MCTS search tree. Experimental evaluation shows that CMCGS outperforms comparable planning methods in several complex continuous DeepMind Control Suite benchmarks and 2D navigation and exploration tasks with limited sample budgets. Furthermore, CMCGS can be scaled up through parallelization, and it outperforms the Cross-Entropy Method (CEM) in continuous control with learned dynamics models.

Continuous Monte Carlo Graph Search

TL;DR

Continuous Monte Carlo Graph Search (CMCGS), an extension of MCTS to online planning in environments with continuous state and action spaces, takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance.

Abstract

Online planning is crucial for high performance in many complex sequential decision-making tasks. Monte Carlo Tree Search (MCTS) employs a principled mechanism for trading off exploration for exploitation for efficient online planning, and it outperforms comparison methods in many discrete decision-making domains such as Go, Chess, and Shogi. Subsequently, extensions of MCTS to continuous domains have been developed. However, the inherent high branching factor and the resulting explosion of the search tree size are limiting the existing methods. To address this problem, we propose Continuous Monte Carlo Graph Search (CMCGS), an extension of MCTS to online planning in environments with continuous state and action spaces. CMCGS takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance. To implement this idea, at each time step, CMCGS clusters similar states into a limited number of stochastic action bandit nodes, which produce a layered directed graph instead of an MCTS search tree. Experimental evaluation shows that CMCGS outperforms comparable planning methods in several complex continuous DeepMind Control Suite benchmarks and 2D navigation and exploration tasks with limited sample budgets. Furthermore, CMCGS can be scaled up through parallelization, and it outperforms the Cross-Entropy Method (CEM) in continuous control with learned dynamics models.
Paper Structure (27 sections, 4 equations, 7 figures, 28 tables, 1 algorithm)

This paper contains 27 sections, 4 equations, 7 figures, 28 tables, 1 algorithm.

Figures (7)

  • Figure 1: Core steps in one iteration of Continuous Monte Carlo Graph Search (CMCGS). a) Starting from the root node, the graph is navigated via action sampling and node selection until a sink node is reached. b) If there is enough experience collected in the final layer of the graph and the maximum depth has not been reached, a new layer containing a new node N is initialized. c) A trajectory of random actions is simulated from the graph's sink node to approximate the value of the state. d) The computed accumulated reward is backed up through the selected nodes, updating their replay memories, policies, and state distributions. e) If a new cluster of experience data is found in a previous layer of the graph, all nodes in that layer are updated based on the new clustering information (in this example, the node N is split into two new nodes N1 and N2).
  • Figure 2: An illustration of exploration in the toy environment. For clarity, the actions have been clipped to be in the range $[-1, 1]$. The agent, the chosen action, the explored trajectories, and the different-sized rewards are represented by the blue dot, red arrow, grey dashed lines, and green dots, respectively. The search nodes of CMCGS and the corresponding state and action means are illustrated with black dots and arrows. The left image illustrates how CMCGS explores with state-dependent policies. In the other pictures, we see how CEM fails in the environment. CEM can either fail to discover the large reward and choose a suboptimal action (center) or completely fail to handle the multimodality required by the environment (right).
  • Figure 3: Custom environments used in the experiments.
  • Figure 4: The mean episode returns averaged over the custom 2D environments that test exploration (left) and the environments from DeepMind Control Suite (right).
  • Figure 5: Reward plots for different simulation budgets per timestep. 2d-X environments (first row) have been developed by the authors to test exploration. The rest are from DeepMind Control Suite with proprioceptive observations. The proposed CMCGS shows the best performance overall.
  • ...and 2 more figures