Table of Contents
Fetching ...

Nudging the Boundaries of LLM Reasoning

Justin Chih-Yao Chen, Becky Xiangyu Peng, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Mohit Bansal, Chien-Sheng Wu

TL;DR

NuRL addresses the limitation of online RL in LLM reasoning by enabling learning from hard, previously unsolvable problems through self-generated, abstract hints. It combines offline hint collection with adaptive online rollout augmentation, injecting hints only for difficult cases after GRPO convergence to expand the model's reasoning ceiling. Empirically, NuRL yields consistent gains across six benchmarks and three models, and can further boost performance when hints come from a stronger external model, while remaining complementary to test-time scaling. The key insight is that high-level hints broaden the model's comfort zone, turn unsolvable samples solvable, and improve upper-bound performance with selective, informative guidance.

Abstract

Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are "unsolvable" to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. Consequently, the model's "upper limit" remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard samples cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a "nudging" method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a CoT and then produces a hint containing the core knowledge needed to solve the problem. During training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated, avoiding distributional shift and do not rely on external models. NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model's upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level, and are most beneficial when applied necessarily and after GRPO has converged.

Nudging the Boundaries of LLM Reasoning

TL;DR

NuRL addresses the limitation of online RL in LLM reasoning by enabling learning from hard, previously unsolvable problems through self-generated, abstract hints. It combines offline hint collection with adaptive online rollout augmentation, injecting hints only for difficult cases after GRPO convergence to expand the model's reasoning ceiling. Empirically, NuRL yields consistent gains across six benchmarks and three models, and can further boost performance when hints come from a stronger external model, while remaining complementary to test-time scaling. The key insight is that high-level hints broaden the model's comfort zone, turn unsolvable samples solvable, and improve upper-bound performance with selective, informative guidance.

Abstract

Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are "unsolvable" to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. Consequently, the model's "upper limit" remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard samples cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a "nudging" method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a CoT and then produces a hint containing the core knowledge needed to solve the problem. During training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated, avoiding distributional shift and do not rely on external models. NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model's upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level, and are most beneficial when applied necessarily and after GRPO has converged.

Paper Structure

This paper contains 17 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: GRPO yields substantial gains, but the improvements largely stem from extending the model's ability within its comfort zone, i.e., if the model fails to solve a hard problem after numerous attempts, it is unable to learn from that problem. In NuRL, we address this by exploring various forms of hints (abstract cues, partial steps, explanations, or even the gold answer), which can be self-generated or teacher-generated. Both self- and teacher-generated abstract cues can expand the model's comfort zone, effectively transforming previously unsolvable problems into solvable ones.
  • Figure 2: NuRL provides targeted guidance to the LLM policy during online GRPO training. Prior to training, we construct an offline collection of hints, defined as abstract problem-specific cues that reduce task difficulty. During the online training, whenever all $\mathcal{G}$ rollouts for a problem are incorrect, NuRL augments $\mathcal{G}-1$ of the rollouts with the corresponding hint and regenerates the batch. This intervention facilitates the acquisition of non-zero rewards on instances that would otherwise yield uniformly zero rewards, thereby supplying informative training signals.
  • Figure 3: Compared to GRPO's improvements with Self-Consistency ($+7.6\%$ and $+7.8\%$ on Llama and OctoThinker), NuRL obtains larger gains with $+8.0\%$ and $+9.4\%$, respectively.
  • Figure 4: Comparison of different types of hints. From left to right, the hints vary in how directly they disclose information about the ground-truth answer. At the leftmost end, abstract hints provide only high-level guidance without revealing details of the solution or answer, whereas at the rightmost end, the answer is given explicitly. Interestingly, more direct hints lead to worse performance.
  • Figure 5: When the base model (Llama) already has strong pre-trained knowledge (e.g., MATH 500), both GRPO and NuRL yield little improvement in pass@k. In contrast, on tasks with lower upper-bound performance (e.g., Date Understanding and GPQA, with pass@1024 of 85.4 and 67.2), GRPO provides no gains on pass@1024, while NuRL pushes it further.
  • ...and 1 more figures