Table of Contents
Fetching ...

Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs

Wenjian Zhang, Kongcheng Zhang, Jiaxin Qi, Baisheng Lai, Jianqiang Huang

Abstract

Reinforcement Learning (RL) with rubric-based rewards has recently shown remarkable progress in enhancing general reasoning capabilities of Large Language Models (LLMs), yet still suffers from ineffective exploration confined to curent policy distribution. In fact, RL optimization can be viewed as steering the policy toward an ideal distribution that maximizes the rewards, while effective exploration should align efforts with desired target. Leveraging this insight, we propose HeRL, a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. Concretely, HeRL treats failed trajectories along with their unmet rubrics as hindsight experience, which serves as in-context guidance for the policy to explore desired responses beyond its current distribution. Additionally, we introduce a bonus reward to incentivize responses with greater potential for improvement under such guidance. HeRL facilitates effective learning from desired high quality samples without repeated trial-and-error from scratch, yielding a more accurate estimation of the expected gradient theoretically. Extensive experiments across various benchmarks demonstrate that HeRL achieves superior performance gains over baselines, and can further benefit from experience guided self-improvement at test time. Our code is available at https://github.com/sikelifei/HeRL.

Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs

Abstract

Reinforcement Learning (RL) with rubric-based rewards has recently shown remarkable progress in enhancing general reasoning capabilities of Large Language Models (LLMs), yet still suffers from ineffective exploration confined to curent policy distribution. In fact, RL optimization can be viewed as steering the policy toward an ideal distribution that maximizes the rewards, while effective exploration should align efforts with desired target. Leveraging this insight, we propose HeRL, a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. Concretely, HeRL treats failed trajectories along with their unmet rubrics as hindsight experience, which serves as in-context guidance for the policy to explore desired responses beyond its current distribution. Additionally, we introduce a bonus reward to incentivize responses with greater potential for improvement under such guidance. HeRL facilitates effective learning from desired high quality samples without repeated trial-and-error from scratch, yielding a more accurate estimation of the expected gradient theoretically. Extensive experiments across various benchmarks demonstrate that HeRL achieves superior performance gains over baselines, and can further benefit from experience guided self-improvement at test time. Our code is available at https://github.com/sikelifei/HeRL.
Paper Structure (25 sections, 1 theorem, 16 equations, 5 figures, 4 tables)

This paper contains 25 sections, 1 theorem, 16 equations, 5 figures, 4 tables.

Key Result

Proposition 3.1

Let $w$ denote the point weight set of all rubrics, $w_+$ denote the point weight set of satisfied rubrics, and $w_-$ denote the point weight set of unsatisfied rubrics. The expected (ideal) reward is $R_I(\tau; q) = w^\top \cdot \mathbf{1}$, and the estimated reward at step $T$ is $R_T(\tau; q) = w

Figures (5)

  • Figure 1: (Left) A conceptual illustration of the efficiency of exploration and the effectiveness of experience guided sampling. (Right) Model performance comparison between baselines and our proposed HeRL across different reasoning domains.
  • Figure 2: Performance comparison of sampling strategies. The guided sampling by hindsight experience consistently outperforms stochastic sampling and entropy-based sampling.
  • Figure 3: The overall framework of HeRL. First, we sample candidate trajectories and evaluate them using checklist-style rubrics. Then we revise failed trajectories with highest reward and preserve the best improvements. Both the original attempts and subsequent improvements are optimized using reinforcement learning, supplemented by a bonus reward to incentive responses with higher improvement potential.
  • Figure 4: Investigation of HeRL’s sampling efficiency. (a) Pass@k performance of Qwen2.5-7B-Instruct, RLVR, and HeRL on IFBench. (b) Iterative revision with experience guidance further outperforms Pass@k for both RLVR and HeRL on HealthBench-500 with Qwen3-4B-Instruct-2507.
  • Figure 5: Training dynamics of different models. (Top) Entropy, (Mid) Reward and (Bottom) Validation Reward curves over training steps on the RAR-Medicine dataset are reported for three base models, comparing HeRL with RLVR baseline. Model names are abbreviated in the plots: Qwen2.5 denotes Qwen2.5-7B-Instruct, Qwen3 denotes Qwen3-4B-Instruct-2504, and Llama denotes Llama3.2-3B-Instruct.

Theorems & Definitions (3)

  • Proposition 3.1
  • Remark 3.2
  • proof