Table of Contents
Fetching ...

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

Yuning Wu, Ke Wang, Devin Chen, Kai Wei

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by naturally annealing the teacher signal as the policy improves, HAPO recovers the unbiased on-policy gradient. This ensures off-policy guidance acts as a temporary scaffold rather than a persistent ceiling, enabling the model to surpass the limitations of static teacher forcing.

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by naturally annealing the teacher signal as the policy improves, HAPO recovers the unbiased on-policy gradient. This ensures off-policy guidance acts as a temporary scaffold rather than a persistent ceiling, enabling the model to surpass the limitations of static teacher forcing.
Paper Structure (18 sections, 2 theorems, 14 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 2 theorems, 14 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.1

Assume the policy $\pi_\theta$ is differentiable, the reward function is bounded, and the gradients of both the shaping operator $\mathcal{F}$ and the CLIP loss satisfy $\|\nabla \mathcal{L}\| \le G$. With a decaying learning rate $\eta_t = \mathcal{O}(1/\sqrt{t})$, the HAPO algorithm converges to a

Figures (2)

  • Figure 1: Hindsight-Anchored Policy Optimization (HAPO) system architecture
  • Figure 2: Training dynamics of HAPO compared with LUFFY. From left to right: average reward, generation length, and number of teacher samples during training. For fair comparison, both reward and generation length are computed by excluding trajectories guided by teacher demonstration.

Theorems & Definitions (4)

  • Theorem 4.1: Convergence
  • proof : Sketch
  • Theorem 4.2: Asymptotic Purity
  • proof