Table of Contents
Fetching ...

Doubly Robust Monte Carlo Tree Search

Manqing Liu, Andrew L. Beam

TL;DR

We address sample inefficiency in planning for complex, partially observable environments by integrating Doubly Robust (DR) off-policy evaluation into Monte Carlo Tree Search (MCTS). The proposed DR-MCTS forms a hybrid estimator V_hybrid(h) = beta V_MCTS(h) + (1-beta) V_DR(h) with a softmax-derived target policy pi_e and cross-validated estimates hat V and hat Q. The authors establish unbiasedness of the hybrid estimator and a variance-reduction guarantee, and show improved sample efficiency in Tic-Tac-Toe and VirtualHome tasks, including scenarios where smaller language models suffice. They further demonstrate that DR-MCTS can achieve strong planning performance using smaller language models, highlighting practical benefits for resource-constrained decision-making.

Abstract

We present Doubly Robust Monte Carlo Tree Search (DR-MCTS), a novel algorithm that integrates Doubly Robust (DR) off-policy estimation into Monte Carlo Tree Search (MCTS) to enhance sample efficiency and decision quality in complex environments. Our approach introduces a hybrid estimator that combines MCTS rollouts with DR estimation, offering theoretical guarantees of unbiasedness and variance reduction under specified conditions. Empirical evaluations in Tic-Tac-Toe and the partially observable VirtualHome environment demonstrate DR-MCTS's superior performance over standard MCTS. In Tic-Tac-Toe, DR-MCTS achieves an 88% win rate compared to a 10% win rate for standard MCTS. In compound VirtualHome tasks, DR-MCTS attains a 20.7% success rate versus 10.3% for standard MCTS. Our scaling analysis reveals that DR-MCTS exhibits better sample efficiency, notably outperforming standard MCTS with larger language models while using a smaller model. These results underscore DR-MCTS's potential for efficient decision-making in complex, real-world scenarios where sample efficiency is paramount.

Doubly Robust Monte Carlo Tree Search

TL;DR

We address sample inefficiency in planning for complex, partially observable environments by integrating Doubly Robust (DR) off-policy evaluation into Monte Carlo Tree Search (MCTS). The proposed DR-MCTS forms a hybrid estimator V_hybrid(h) = beta V_MCTS(h) + (1-beta) V_DR(h) with a softmax-derived target policy pi_e and cross-validated estimates hat V and hat Q. The authors establish unbiasedness of the hybrid estimator and a variance-reduction guarantee, and show improved sample efficiency in Tic-Tac-Toe and VirtualHome tasks, including scenarios where smaller language models suffice. They further demonstrate that DR-MCTS can achieve strong planning performance using smaller language models, highlighting practical benefits for resource-constrained decision-making.

Abstract

We present Doubly Robust Monte Carlo Tree Search (DR-MCTS), a novel algorithm that integrates Doubly Robust (DR) off-policy estimation into Monte Carlo Tree Search (MCTS) to enhance sample efficiency and decision quality in complex environments. Our approach introduces a hybrid estimator that combines MCTS rollouts with DR estimation, offering theoretical guarantees of unbiasedness and variance reduction under specified conditions. Empirical evaluations in Tic-Tac-Toe and the partially observable VirtualHome environment demonstrate DR-MCTS's superior performance over standard MCTS. In Tic-Tac-Toe, DR-MCTS achieves an 88% win rate compared to a 10% win rate for standard MCTS. In compound VirtualHome tasks, DR-MCTS attains a 20.7% success rate versus 10.3% for standard MCTS. Our scaling analysis reveals that DR-MCTS exhibits better sample efficiency, notably outperforming standard MCTS with larger language models while using a smaller model. These results underscore DR-MCTS's potential for efficient decision-making in complex, real-world scenarios where sample efficiency is paramount.

Paper Structure

This paper contains 23 sections, 4 theorems, 30 equations, 3 figures, 2 tables, 2 algorithms.

Key Result

Theorem 2.1

The hybrid estimator is unbiased for estimating the value of the target policy $\pi_e$.

Figures (3)

  • Figure 1: Win rates for Tic-Tac-Toe across number of rollouts. Top: DR-MCTS vs. MCTS. Bottom: IS-MCTS vs. MCTS
  • Figure 2: Success rates for Virtual Home tasks across different algorithms and model sizes
  • Figure 3: Scaling analysis for different task types in Virtual Home. Top: Novel Objects tasks. Middle: Novel Comp. tasks. Bottom: Novel Comp. + Objects tasks.

Theorems & Definitions (6)

  • Theorem 2.1: Unbiasedness of Hybrid Estimator
  • Theorem 2.2: Variance Reduction Condition for Hybrid Estimator
  • Theorem 1.1: Unbiasedness of Hybrid Estimator
  • proof
  • Theorem 1.2: Variance Reduction Condition for Hybrid Estimator
  • proof