C-MCTS: Safe Planning with Monte Carlo Tree Search

Dinesh Parthasarathy; Georgios Kontes; Axel Plinge; Christopher Mutschler

C-MCTS: Safe Planning with Monte Carlo Tree Search

Dinesh Parthasarathy, Georgios Kontes, Axel Plinge, Christopher Mutschler

TL;DR

This work proposes Constrained MCTS (C-MCTS), which estimates cost using a safety critic that is trained with Temporal Difference learning in an offline phase prior to agent deployment, and is less susceptible to cost violations than previous work.

Abstract

The Constrained Markov Decision Process (CMDP) formulation allows to solve safety-critical decision making tasks that are subject to constraints. While CMDPs have been extensively studied in the Reinforcement Learning literature, little attention has been given to sampling-based planning algorithms such as MCTS for solving them. Previous approaches perform conservatively with respect to costs as they avoid constraint violations by using Monte Carlo cost estimates that suffer from high variance. We propose Constrained MCTS (C-MCTS), which estimates cost using a safety critic that is trained with Temporal Difference learning in an offline phase prior to agent deployment. The critic limits exploration by pruning unsafe trajectories within MCTS during deployment. C-MCTS satisfies cost constraints but operates closer to the constraint boundary, achieving higher rewards than previous work. As a nice byproduct, the planner is more efficient w.r.t. planning steps. Most importantly, under model mismatch between the planner and the real world, C-MCTS is less susceptible to cost violations than previous work.

C-MCTS: Safe Planning with Monte Carlo Tree Search

TL;DR

Abstract

Paper Structure (18 sections, 3 theorems, 5 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 3 theorems, 5 equations, 6 figures, 3 tables, 1 algorithm.

Introduction
Monte Carlo Tree Search for Constrained MDPs
Constrained Monte Carlo Tree Search (C-MCTS)
Evaluation
Conclusion
Methodology
Guided Bootstrapping of the Safety Critic Ensemble
Considerations on the Reliability of the Safety Critic
Details on the Experimental Setup
Environments
Rocksample
Safe Gridworld
Training Details & Compute
Additional Experimental Results
Planning cost to achieve high rewards: C-MCTS vs CC-MCP
...and 3 more sections

Key Result

Proposition 1

This iterative optimization process converges asymptotically to the optimal $\lambda^{*}$, in the $k$-th MDP.

Figures (6)

Figure 1: Simplified flow of training phase in C-MCTS.
Figure 2: Performance of C-MCTS, MCTS, and CC-MCP on different Rocksample configurations evaluated on 100 episodes. The shaded region represents the standard deviation over all episodes.
Figure 3: Comparing safety for different training/deployment strategies, i.e., using different planning horizons during training (left), deploying with different ensemble thresholds (middle), and collecting training samples from simulators of different accuracies (right).
Figure 4: Environments: (left) exemplary Rocksample$(7,8)$ environment, i.e., a $7 \times 7$ rocksample environment with $8$ rocks randomly placed; (right) exemplary Safe Gridworld environment, where the colors denote start cells (yellow), the goal cell (green), unsafe cells (pink), and windy cells (blue).
Figure 5: Maximum depth of the search tree for C-MCTS, MCTS and CC-MCP on different rocksample configurations averaged over 100 episodes.
...and 1 more figures

Theorems & Definitions (6)

Definition 1
Definition 2
Definition 3
Proposition 1
Proposition 2
Corollary 1

C-MCTS: Safe Planning with Monte Carlo Tree Search

TL;DR

Abstract

C-MCTS: Safe Planning with Monte Carlo Tree Search

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)