Table of Contents
Fetching ...

Parallelizing Tree Search with Twice Sequential Monte Carlo

Yaniv Oren, Joery A. de Vries, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin Böhmer

TL;DR

TSMCTS addresses the core bottlenecks of SMC-based planning in RL—high variance with depth and root path degeneracy—by reformulating SMC for root-value estimation (SMCTS) and combining it with Sequential Halving at the root. The method preserves SMC's parallelizable, memory-efficient nature while achieving lower estimator variance and better scaling with sequential compute. Empirically, TSMCTS outperforms SMC baselines and a modern MCTS variant across discrete and continuous domains, with ablations validating variance reduction and degeneracy mitigation. This work extends model-based RL planning by integrating SH and SMCTS to deliver scalable, GPU-friendly planning with improved performance.

Abstract

Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS. Through variance reduction and mitigation of path degeneracy, TSMCTS scales favorably with sequential compute while retaining the properties that make SMC natural to parallelize.

Parallelizing Tree Search with Twice Sequential Monte Carlo

TL;DR

TSMCTS addresses the core bottlenecks of SMC-based planning in RL—high variance with depth and root path degeneracy—by reformulating SMC for root-value estimation (SMCTS) and combining it with Sequential Halving at the root. The method preserves SMC's parallelizable, memory-efficient nature while achieving lower estimator variance and better scaling with sequential compute. Empirically, TSMCTS outperforms SMC baselines and a modern MCTS variant across discrete and continuous domains, with ablations validating variance reduction and degeneracy mitigation. This work extends model-based RL planning by integrating SH and SMCTS to deliver scalable, GPU-friendly planning with improved performance.

Abstract

Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS. Through variance reduction and mitigation of path degeneracy, TSMCTS scales favorably with sequential compute while retaining the properties that make SMC natural to parallelize.

Paper Structure

This paper contains 45 sections, 2 theorems, 23 equations, 4 figures, 6 tables, 4 algorithms.

Key Result

Theorem 1

For any improvement operator $\mathcal{I}$, search horizon $T$, prior policy $\pi_\theta$ and evaluation $Q^{\pi_{\theta}}$ RL-SMC with infinite particles is a policy improvement operator.

Figures (4)

  • Figure 1: Averaged returns vs. environment interactions. Mean and 90% two-sided BCa-bootstrap intervals efron_better_1987 across 20 seeds.
  • Figure 2: Averaged returns vs. runtime (seconds). Mean and 90% two-sided BCa-bootstrap intervals across 20 seeds.
  • Figure 3: Left: Performance scaling with depth (higher is better), averaged across environments and particle budgets of $4, 8, 16$. 10 seeds and 90% two-sided BCa-bootstrap intervals. Center: Variance of the root estimator vs. depth (lower is better). Right: The number of actions active in the policy target (constant - no target degeneracy - better). Center and right are averaged across states and particle budgets $4, 8, 16$ in Snake. Mean and $\pm 2$ SEM ($\approx95\%$ Gaussian CI) across 5 seeds.
  • Figure 4: Performance scaling with depth (higher is better, increasing is better). Averaged across environments and particle budgets of $4, 8, 16$ and normalized across environments. Mean and 90% two-sided BCa-bootstrap intervals across 10 seeds.

Theorems & Definitions (3)

  • Theorem 1
  • Corollary 1
  • proof