Doubly Robust Monte Carlo Tree Search
Manqing Liu, Andrew L. Beam
TL;DR
We address sample inefficiency in planning for complex, partially observable environments by integrating Doubly Robust (DR) off-policy evaluation into Monte Carlo Tree Search (MCTS). The proposed DR-MCTS forms a hybrid estimator V_hybrid(h) = beta V_MCTS(h) + (1-beta) V_DR(h) with a softmax-derived target policy pi_e and cross-validated estimates hat V and hat Q. The authors establish unbiasedness of the hybrid estimator and a variance-reduction guarantee, and show improved sample efficiency in Tic-Tac-Toe and VirtualHome tasks, including scenarios where smaller language models suffice. They further demonstrate that DR-MCTS can achieve strong planning performance using smaller language models, highlighting practical benefits for resource-constrained decision-making.
Abstract
We present Doubly Robust Monte Carlo Tree Search (DR-MCTS), a novel algorithm that integrates Doubly Robust (DR) off-policy estimation into Monte Carlo Tree Search (MCTS) to enhance sample efficiency and decision quality in complex environments. Our approach introduces a hybrid estimator that combines MCTS rollouts with DR estimation, offering theoretical guarantees of unbiasedness and variance reduction under specified conditions. Empirical evaluations in Tic-Tac-Toe and the partially observable VirtualHome environment demonstrate DR-MCTS's superior performance over standard MCTS. In Tic-Tac-Toe, DR-MCTS achieves an 88% win rate compared to a 10% win rate for standard MCTS. In compound VirtualHome tasks, DR-MCTS attains a 20.7% success rate versus 10.3% for standard MCTS. Our scaling analysis reveals that DR-MCTS exhibits better sample efficiency, notably outperforming standard MCTS with larger language models while using a smaller model. These results underscore DR-MCTS's potential for efficient decision-making in complex, real-world scenarios where sample efficiency is paramount.
