Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning
Bingning Huang, Tu Nguyen, Matthieu Zimmer
TL;DR
<3-5 sentence high-level summary> Tree-OPO tackles the challenge of learning multi-step reasoning in LLMs by leveraging offline MCTS-derived prefixes as a structured, off-policy curriculum to train policies with GRPO. It introduces Staged Advantage Estimation (SAE), a constrained projection that enforces tree-based ordering of prefix advantages to reduce gradient variance and improve learning stability. The approach combines offline teacher prefixes with online completions, yielding a gradient signal that respects the hierarchical structure of reasoning trees and improves final accuracy on GSM8K and related math datasets. Theoretical analysis establishes variance reduction guarantees for SAE, and empirical results show competitive gains with efficient use of resources compared to standard GRPO and KL-based distillation methods.
Abstract
Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS derived trajectories, traditionally used for training value or reward models, can be repurposed to improve policy optimization in verifier guided reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables consistent policy learning from group relative judgments. We reframe GRPO into a staged training paradigm, leveraging a teacher's MCTS rollouts to construct a tree structured curriculum of prefixes. This introduces the novel challenge of computing advantages for training samples that originate from different prefixes, each with a distinct expected return. To address this, we propose Staged Advantage Estimation (SAE), a framework for computing low variance, prefix aware advantages by projecting rewards onto a constraint set that respects the tree's hierarchy. Our empirical results on mathematical reasoning tasks show that SAE improves final accuracy over standard GRPO. This outcome is grounded in our theoretical analysis, which confirms that SAE reduces gradient variance, a principled path to improved sample efficiency. We demonstrate this through practical SAE implementations, comparing efficient heuristics against a formal quadratic program.
