Table of Contents
Fetching ...

Gaussian Process Aggregation for Root-Parallel Monte Carlo Tree Search with Continuous Actions

Junlin Xiao, Victor-Alexandru Darvariu, Bruno Lacerda, Nick Hawes

TL;DR

The paper tackles the challenge of aggregating results in root-parallel MCTS for continuous action spaces. It introduces GPR2P, a Gaussian Process Regression-based aggregation that interpolates returns over the entire action space and uses the predictive mean to select actions. A reliability threshold and an RBF kernel underpin the GP fitting to retain actions and generalize to unseen ones. Across six diverse environments, including both deterministic and stochastic transitions, GPR2P consistently outperforms prior aggregation strategies, particularly at low trial budgets, with a modest inference-time overhead. The work demonstrates the practical impact of principled interpolation in online planning for continuous domains and suggests avenues for integrating GP guidance with per-thread decision making.

Abstract

Monte Carlo Tree Search is a cornerstone algorithm for online planning, and its root-parallel variant is widely used when wall clock time is limited but best performance is desired. In environments with continuous action spaces, how to best aggregate statistics from different threads is an important yet underexplored question. In this work, we introduce a method that uses Gaussian Process Regression to obtain value estimates for promising actions that were not trialed in the environment. We perform a systematic evaluation across 6 different domains, demonstrating that our approach outperforms existing aggregation strategies while requiring a modest increase in inference time.

Gaussian Process Aggregation for Root-Parallel Monte Carlo Tree Search with Continuous Actions

TL;DR

The paper tackles the challenge of aggregating results in root-parallel MCTS for continuous action spaces. It introduces GPR2P, a Gaussian Process Regression-based aggregation that interpolates returns over the entire action space and uses the predictive mean to select actions. A reliability threshold and an RBF kernel underpin the GP fitting to retain actions and generalize to unseen ones. Across six diverse environments, including both deterministic and stochastic transitions, GPR2P consistently outperforms prior aggregation strategies, particularly at low trial budgets, with a modest inference-time overhead. The work demonstrates the practical impact of principled interpolation in online planning for continuous domains and suggests avenues for integrating GP guidance with per-thread decision making.

Abstract

Monte Carlo Tree Search is a cornerstone algorithm for online planning, and its root-parallel variant is widely used when wall clock time is limited but best performance is desired. In environments with continuous action spaces, how to best aggregate statistics from different threads is an important yet underexplored question. In this work, we introduce a method that uses Gaussian Process Regression to obtain value estimates for promising actions that were not trialed in the environment. We perform a systematic evaluation across 6 different domains, demonstrating that our approach outperforms existing aggregation strategies while requiring a modest increase in inference time.

Paper Structure

This paper contains 18 sections, 9 equations, 4 figures, 3 tables, 3 algorithms.

Figures (4)

  • Figure 1: Illustration of the GPR2P method, which uses Gaussian Process Regression to perform aggregation in root-parallel MCTS. Unlike existing methods, GPR2P can estimate the return for and select actions that were not sampled in the tree.
  • Figure 2: Illustrations of the environments considered in our evaluation. This includes Gymnasium environments (top row) and simplified self-designed environments with stochastic transitions (bottom row). All environments use continuous action spaces.
  • Figure 3: Results obtained by root-parallel MCTS aggregation strategies across all environments. GPR2P performs best overall, followed by Similarity Merge. As expected, the differences diminish as the number of trials increases.
  • Figure 4: Performance comparison of GPR2P versus Similarity Merge in which the GPR2P inference time is used to run additional trials. This does not lead a significant change in the results, highlighting the value of the GPR2P aggregation.