Exploiting Tree Structure for Credit Assignment in RL Training of LLMs
Hieu Tran, Zonghai Yao, Hong Yu
TL;DR
TEMPO introduces a critic-free reinforcement learning approach for LLM alignment by exploiting the prefix-tree structure formed by multiple responses to a prompt. By computing nonparametric prefix values from the descendant outcomes (P2T) and applying branch-gated temporal-difference corrections, TEMPO provides token-level credit only at branching points while preserving a GRPO-compatible training loop. Empirically, TEMPO outperforms PPO, GRPO, and HEPO on both in-distribution math/medical tasks and out-of-distribution benchmarks, achieving higher accuracy and faster convergence under the same hardware budget. This approach delivers fine-grained credit assignment without a value network and shows robust generalization across domains, suggesting practical benefits for reasoning-focused LLM training and potential extensions to verification and retrieval-augmented reasoning.
Abstract
Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values \(V(s)\) by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.
