Table of Contents
Fetching ...

Hindsight Credit Assignment for Long-Horizon LLM Agents

Hui-Ze Tan, Xiao-Wen Yang, Hao Chen, Jie-Jing Shao, Yi Wen, Yuteng Shen, Weihong Luo, Xiku Du, Lan-Zhe Guo, Yu-Feng Li

TL;DR

HCAPO is introduced, the first framework to integrate hindsight credit assignment into LLM agents and significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.

Abstract

Large Language Model (LLM) agents often face significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states. To address these limitations, we introduce HCAPO, the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning. Furthermore, HCAPO's multi-scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states. Evaluations across three challenging benchmarks, including WebShop and ALFWorld, demonstrate that HCAPO consistently outperforms state-of-the-art RL methods. Notably, HCAPO achieves a 7.7% improvement in success rate on WebShop and a 13.8% on ALFWorld over GRPO using the Qwen2.5-7B-Instruct model. These results indicate that HCAPO significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.

Hindsight Credit Assignment for Long-Horizon LLM Agents

TL;DR

HCAPO is introduced, the first framework to integrate hindsight credit assignment into LLM agents and significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.

Abstract

Large Language Model (LLM) agents often face significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states. To address these limitations, we introduce HCAPO, the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning. Furthermore, HCAPO's multi-scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states. Evaluations across three challenging benchmarks, including WebShop and ALFWorld, demonstrate that HCAPO consistently outperforms state-of-the-art RL methods. Notably, HCAPO achieves a 7.7% improvement in success rate on WebShop and a 13.8% on ALFWorld over GRPO using the Qwen2.5-7B-Instruct model. These results indicate that HCAPO significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.
Paper Structure (46 sections, 12 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 46 sections, 12 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: From trajectory-level to step-level: hindsight credit assignment for long-horizon agents. $\rho$ is the hindsight ratio.
  • Figure 2: The HCAPO framework. (a) Illustrates the generative verification process: for a candidate action $a_t$, the LLM acts as a critic to compute the hindsight score $\rho_t$ by conditioning on the state $s_t$ and hindsight information $s_{final}$. (b) Shows the full optimization loop where a group of $G$ trajectories is evaluated via Hindsight Q-values to produce the final group-based advantage $A_{i,t}$.
  • Figure 3: LEFT: Proportion of redundant actions during training in webshop task. RIGHT: Path-shortening effect of HCAPO vs. GRPO in webshop task.
  • Figure 4: Computational cost breakdown during training. The hindsight audit pass accounts for only 8.3% of total training time.
  • Figure 5: Success Rate during training