Table of Contents
Fetching ...

Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning

Bingning Huang, Tu Nguyen, Matthieu Zimmer

TL;DR

<3-5 sentence high-level summary> Tree-OPO tackles the challenge of learning multi-step reasoning in LLMs by leveraging offline MCTS-derived prefixes as a structured, off-policy curriculum to train policies with GRPO. It introduces Staged Advantage Estimation (SAE), a constrained projection that enforces tree-based ordering of prefix advantages to reduce gradient variance and improve learning stability. The approach combines offline teacher prefixes with online completions, yielding a gradient signal that respects the hierarchical structure of reasoning trees and improves final accuracy on GSM8K and related math datasets. Theoretical analysis establishes variance reduction guarantees for SAE, and empirical results show competitive gains with efficient use of resources compared to standard GRPO and KL-based distillation methods.

Abstract

Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS derived trajectories, traditionally used for training value or reward models, can be repurposed to improve policy optimization in verifier guided reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables consistent policy learning from group relative judgments. We reframe GRPO into a staged training paradigm, leveraging a teacher's MCTS rollouts to construct a tree structured curriculum of prefixes. This introduces the novel challenge of computing advantages for training samples that originate from different prefixes, each with a distinct expected return. To address this, we propose Staged Advantage Estimation (SAE), a framework for computing low variance, prefix aware advantages by projecting rewards onto a constraint set that respects the tree's hierarchy. Our empirical results on mathematical reasoning tasks show that SAE improves final accuracy over standard GRPO. This outcome is grounded in our theoretical analysis, which confirms that SAE reduces gradient variance, a principled path to improved sample efficiency. We demonstrate this through practical SAE implementations, comparing efficient heuristics against a formal quadratic program.

Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning

TL;DR

<3-5 sentence high-level summary> Tree-OPO tackles the challenge of learning multi-step reasoning in LLMs by leveraging offline MCTS-derived prefixes as a structured, off-policy curriculum to train policies with GRPO. It introduces Staged Advantage Estimation (SAE), a constrained projection that enforces tree-based ordering of prefix advantages to reduce gradient variance and improve learning stability. The approach combines offline teacher prefixes with online completions, yielding a gradient signal that respects the hierarchical structure of reasoning trees and improves final accuracy on GSM8K and related math datasets. Theoretical analysis establishes variance reduction guarantees for SAE, and empirical results show competitive gains with efficient use of resources compared to standard GRPO and KL-based distillation methods.

Abstract

Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS derived trajectories, traditionally used for training value or reward models, can be repurposed to improve policy optimization in verifier guided reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables consistent policy learning from group relative judgments. We reframe GRPO into a staged training paradigm, leveraging a teacher's MCTS rollouts to construct a tree structured curriculum of prefixes. This introduces the novel challenge of computing advantages for training samples that originate from different prefixes, each with a distinct expected return. To address this, we propose Staged Advantage Estimation (SAE), a framework for computing low variance, prefix aware advantages by projecting rewards onto a constraint set that respects the tree's hierarchy. Our empirical results on mathematical reasoning tasks show that SAE improves final accuracy over standard GRPO. This outcome is grounded in our theoretical analysis, which confirms that SAE reduces gradient variance, a principled path to improved sample efficiency. We demonstrate this through practical SAE implementations, comparing efficient heuristics against a formal quadratic program.

Paper Structure

This paper contains 49 sections, 11 theorems, 45 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Lemma 3.1

Let $\widehat{g} = \frac{1}{K} \sum_{k=1}^K a_k\,\nabla_\theta \log \pi_\theta(a_k \mid p_k)$, where $a_k = r_k - \alpha V(p_k)$ and $V(p)$ is any deterministic function of the prefix. Then

Figures (6)

  • Figure 1: Staged‐reasoning tree with reverse curriculum coloring: deeper prefixes are easier (green); shallow prefixes are harder (yellow). Arrows at leaves denote completing trajectories.
  • Figure 2: Tree-OPO vs. GRPO.GRPO (top) standardizes rewards from single-prompt completions, inherently limits its ability to differentiate advantages across paths stemming from common prefixes with disparate expected returns. In contrast, our Tree-OPO (bottom) is designed to adhere to Equation (\ref{['eq:sae-opt']}), facilitating hierarchical advantage ordering aligned with the tree-induced curriculum. This replaces flat standardization, yielding more discriminative advantages for structured generation.
  • Figure 3: Performance comparison of GRPO and Tree-OPO variants across different datasets.
  • Figure 4: Training metrics over time. (\ref{['fig:performance_a_constraint_sat']}) Satisfaction rate of prefix-ordering constraints imposed by structured advantage estimation. (\ref{['fig:performance_b_adv_var']}) Variance of advantage estimates; lower variance improves update stability. (\ref{['fig:performance_c_reward']}) Task-specific average rewards during training. All curves are smoothed; color indicates strategy.
  • Figure 5: Illustration of Staged Reasoning Prompts and Text Mappings. (a) A group of example staged prompts. Each row is a distinct prompt chain, and the checkmark/cross indicates whether the model produced a correct final answer when conditioned on that prompt. (b) Correspondence between symbolic reasoning nodes (Q, A, B, etc.) and their textual content. Each node represents a distinct reasoning step, forming compositional instructions separated by double newlines (' n' n) in the actual prompt fed to the language model.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Lemma 3.1: Unbiasedness williams1992simple
  • Lemma 3.2: Optimality of Expectation Baseline greensmith2004variance
  • Lemma 3.3: Tree-Induced Advantage Structure and Inductive Bias
  • Theorem 3.4: Tree Constraints Improve Gradient Signal
  • Theorem 3.5: SAE reduces estimation-to-class error
  • Lemma A.1: Unbiasedness of Gradient Estimate williams1992simple
  • proof
  • Lemma A.2: Expectation Minimizes Variancegreensmith2004variance
  • proof
  • Lemma A.3: Tree-Induced Advantage Structure
  • ...and 9 more