Table of Contents
Fetching ...

Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

Qiao Liang, Yuke Zhu, Chao Ge, Lei Yang, Ying Shen, Bo Zheng, Sheng Guo

TL;DR

This work tackles sparse, delayed rewards in long-horizon tool-integrated reasoning by introducing Error-Localized Policy Optimization (ELPO). ELPO localizes the first irrecoverable step through a binary-search rollout tree under a fixed budget $N_{ ext{total}}$, then derives stable learning signals via hierarchical advantages that combine local branch comparisons with global trajectory rankings, augmented by error-localized adaptive clipping. Empirical results across mathematical reasoning, science QA, and code execution show ELPO consistently outperforms strong Agentic RL baselines, with improved Pass@K and Major@K scaling and more efficient tool usage. The approach provides a practical path toward finer-grained credit assignment in complex LLM reasoning tasks that rely on external tools.

Abstract

Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.

Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

TL;DR

This work tackles sparse, delayed rewards in long-horizon tool-integrated reasoning by introducing Error-Localized Policy Optimization (ELPO). ELPO localizes the first irrecoverable step through a binary-search rollout tree under a fixed budget , then derives stable learning signals via hierarchical advantages that combine local branch comparisons with global trajectory rankings, augmented by error-localized adaptive clipping. Empirical results across mathematical reasoning, science QA, and code execution show ELPO consistently outperforms strong Agentic RL baselines, with improved Pass@K and Major@K scaling and more efficient tool usage. The approach provides a practical path toward finer-grained credit assignment in complex LLM reasoning tasks that rely on external tools.

Abstract

Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.
Paper Structure (29 sections, 12 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 29 sections, 12 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: For each initially failed trajectory (Pass@16$=0$), we edit exactly one step (random error, first error, or the 1st/2nd/3rd irrecoverable step) and resample completions. Recovery is counted when the repaired run reaches Pass@16$=1$.
  • Figure 2: The overview of Error-Localized Policy Optimization (ELPO).
  • Figure 3: Local ranking quality at branching prefixes (pairwise accuracy / Kendall’s $\tau$ vs Mean@32 reference).
  • Figure 4: An example of a rollout tree based on BEL. The red box outlines the first irrecoverable step identified by the rollout tree.
  • Figure 5: Pass@K and Major@K sampling analysis on AIME2024/2025.
  • ...and 6 more figures