Table of Contents
Fetching ...

Learning to Hint for Reinforcement Learning

Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He

Abstract

Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at https://github.com/Andree-9/HiLL.

Learning to Hint for Reinforcement Learning

Abstract

Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at https://github.com/Andree-9/HiLL.

Paper Structure

This paper contains 15 sections, 1 theorem, 17 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

For a question-hint pair $(q,h)$, let $P_h(\tau)=\pi_\theta(\tau\mid q{+}h)$ and $P(\tau)=\pi_\theta(\tau\mid q)$ denote the rollout distributions under the hinted and original inputs, with success probabilities $p_h=P_h\bigl(r(\tau){=}1\bigr)$ and $p=P\bigl(r(\tau){=}1\bigr)$. If $p_h>0$, then and therefore $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: Overview of our HiLL framework. Given a question $q$ with an all-incorrect group, the hinter $\mathcal{H}_\phi$ takes the question, a failed rollout $\tau_k$, and the reference solution $z^\star$ as input, and generates $M$ candidate hints. The reasoner $\pi_\theta$ re-samples $G$ rollouts under each hinted input $q{+}h_j$. Each hint is then scored by signal creation (Sec. \ref{['subsec:signal_creation']}) and signal transfer (Sec. \ref{['subsec:transfer']}). The best hinted group is selected reasoner GRPO update, while all candidate hints form the group for the hinter GRPO update.
  • Figure 2: All-incorrect ratio (left two) and hint reliance (right two) over training steps. Both HiLL variants substantially reduce the fraction of degenerate all-incorrect groups compared to GRPO. Transfer weighting (HiLL vs. HiLL$_\text{w/o TW}$) keeps hint reliance consistently lower throughout training.
  • Figure 3: Effect of transfer temperature $T$ on average signal creation, signal transfer, and in-distribution accuracy for HiLL with Llama-3.2-3B-Instruct. Dashed lines show HiLL$_\text{w/o TW}$.
  • Figure 4: Hint length and math expressions per hint.

Theorems & Definitions (2)

  • Proposition 1: Transfer bound
  • proof