Table of Contents
Fetching ...

Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

Hieu Tran, Zonghai Yao, Hong Yu

TL;DR

TEMPO introduces a critic-free reinforcement learning approach for LLM alignment by exploiting the prefix-tree structure formed by multiple responses to a prompt. By computing nonparametric prefix values from the descendant outcomes (P2T) and applying branch-gated temporal-difference corrections, TEMPO provides token-level credit only at branching points while preserving a GRPO-compatible training loop. Empirically, TEMPO outperforms PPO, GRPO, and HEPO on both in-distribution math/medical tasks and out-of-distribution benchmarks, achieving higher accuracy and faster convergence under the same hardware budget. This approach delivers fine-grained credit assignment without a value network and shows robust generalization across domains, suggesting practical benefits for reasoning-focused LLM training and potential extensions to verification and retrieval-augmented reasoning.

Abstract

Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values \(V(s)\) by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.

Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

TL;DR

TEMPO introduces a critic-free reinforcement learning approach for LLM alignment by exploiting the prefix-tree structure formed by multiple responses to a prompt. By computing nonparametric prefix values from the descendant outcomes (P2T) and applying branch-gated temporal-difference corrections, TEMPO provides token-level credit only at branching points while preserving a GRPO-compatible training loop. Empirically, TEMPO outperforms PPO, GRPO, and HEPO on both in-distribution math/medical tasks and out-of-distribution benchmarks, achieving higher accuracy and faster convergence under the same hardware budget. This approach delivers fine-grained credit assignment without a value network and shows robust generalization across domains, suggesting practical benefits for reasoning-focused LLM training and potential extensions to verification and retrieval-augmented reasoning.

Abstract

Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values \(V(s)\) by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.

Paper Structure

This paper contains 34 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Comparison of credit assignment for RL training with verifiable rewards. GRPO: all tokens in each sampled answer share one sequence-level return; branching is ignored so credit spreads evenly. PPO: a learned value network estimates $V(s_t)$ and provides token-level advantages via GAE, but requires a critic and higher compute. TEMPO: convert the answer group for one prompt into a prefix tree and compute nonparametric prefix values $V(s)$ by averaging descendant outcomes; use branch-gated TD corrections to assign credit at branches.
  • Figure 2: Overview of prefix tree value estimation in TEMPO. Each node corresponds to a token prefix $s$, with $V(s)$ estimated by averaging over the outcomes of all descendant completions. Green leaves denote correct responses ($r=1$), red leaves denote incorrect ones ($r=0$). Intermediate nodes inherit averaged values (e.g., $V(s)=0.5$), providing informative signals at branching points.
  • Figure 3: Validation accuracy of MATH and MedQA for Qwen3-1.7B and Qwen3-4B. We compare TEMPO with PPO, GRPO, and HEPO. TEMPO consistently achieves higher accuracy and faster convergence across both domains and model sizes.
  • Figure 4: TEMPO converges faster and to higher accuracy than GRPO, passes GRPO’s peak performance in fewer iterations and less overall time.
  • Figure 5: Effect of group size on MATH accuracy for Qwen3-1.7B. TEMPO consistently outperforms GRPO across all settings.
  • ...and 3 more figures