Table of Contents
Fetching ...

What Can You Do When You Have Zero Rewards During RL?

Jatin Prakash, Anirudh Buvanesh

TL;DR

The paper investigates how to overcome the zero-reward barrier in reinforcement learning for reasoning tasks, by evaluating several baselines designed for sparse rewards on a controlled graph-search task. It finds that these methods fail when the base model cannot sample any correct solution, highlighting a cold-start problem. The authors demonstrate a simple data-centric remedy—adding easier samples to the training set—to bootstrap learning without altering the RL algorithm, and provide open-source implementations for replication. This work underscores the practical value of implicit curricula via data composition for enabling RL in zero-reward scenarios and offers a scalable recipe for practitioners.

Abstract

Reinforcement learning (RL) with outcome-based rewards has proven effective for improving large language models (LLMs) on complex reasoning tasks. However, its success often depends on the base model occasionally sampling correct solutions. When no correct solutions are sampled, training encounters a zero-reward barrier where learning stalls due to zero gradients. We study this scenario through the graph search task introduced in Bachmann et al. (2024) and evaluate recent methods that incorporate desirable components such as dense rewards, diversity incentives, and improved credit assignment. Our experiments show that none of these approaches overcome the zero-reward barrier if the base model never produces a correct answer. In contrast, we find that a simple data-centric intervention of adding easier samples to the training set enables the model to eventually solve the original hard task despite starting from zero reward. Importantly, this succeeds without modifying the RL algorithm itself. Because official implementations of several baselines were unavailable, we developed our own, which allowed us to conduct a detailed analysis of their failure modes. We release these implementations to support further research at: https://github.com/rl4reasoning/rl-baselines

What Can You Do When You Have Zero Rewards During RL?

TL;DR

The paper investigates how to overcome the zero-reward barrier in reinforcement learning for reasoning tasks, by evaluating several baselines designed for sparse rewards on a controlled graph-search task. It finds that these methods fail when the base model cannot sample any correct solution, highlighting a cold-start problem. The authors demonstrate a simple data-centric remedy—adding easier samples to the training set—to bootstrap learning without altering the RL algorithm, and provide open-source implementations for replication. This work underscores the practical value of implicit curricula via data composition for enabling RL in zero-reward scenarios and offers a scalable recipe for practitioners.

Abstract

Reinforcement learning (RL) with outcome-based rewards has proven effective for improving large language models (LLMs) on complex reasoning tasks. However, its success often depends on the base model occasionally sampling correct solutions. When no correct solutions are sampled, training encounters a zero-reward barrier where learning stalls due to zero gradients. We study this scenario through the graph search task introduced in Bachmann et al. (2024) and evaluate recent methods that incorporate desirable components such as dense rewards, diversity incentives, and improved credit assignment. Our experiments show that none of these approaches overcome the zero-reward barrier if the base model never produces a correct answer. In contrast, we find that a simple data-centric intervention of adding easier samples to the training set enables the model to eventually solve the original hard task despite starting from zero reward. Importantly, this succeeds without modifying the RL algorithm itself. Because official implementations of several baselines were unavailable, we developed our own, which allowed us to conduct a detailed analysis of their failure modes. We release these implementations to support further research at: https://github.com/rl4reasoning/rl-baselines

Paper Structure

This paper contains 25 sections, 9 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: (Left) Illustration of the Degree-3-Path-3 task, where the center node ($4$) has degree $3$ and each outgoing path has length $3$. The graph is represented as an adjacency list, and given a source node ($4$) and a destination node ($7$), the task is to output a path from source to destination (e.g., $4,2,7$). See Appendix \ref{['sec:appendix-prompts']} for the prompt used. (Right) Success rates of different baselines: Dr.GRPO, VinePPO, Progress Rewards, and Best-of-N aware finetuning, compared with our data-mixing approach, which augments the training dataset with an equal proportion of samples from the easier Deg-5-Path-5 dataset. The baselines fail to break the zero-reward barrier, yielding zero success on the test set, whereas mixing in easier samples if effective with outcome rewards.
  • Figure 2: Success rates of different RL algorithms (Dr.GRPO, VinePPO, Progress Rewards, and Best-of-N aware finetuning) on a held-out test set of Degree-3-Path-3 graphs. These models were trained on Degree-3-Path-3 graphs. All algorithms are able to solve the task when the model starts with a reasonable success rate. Furthermore, VinePPO converges in fewer iterations compared to Dr.GRPO, consistent with findings reported in the literature.
  • Figure 3: Effect of Progress Rewards using different prover policies. (Left): Fraction of non-zero step advantages ($\hat{A}^{\mu}_{y_{c_i}} \neq 0$ in Equation \ref{['eqn:rewarding-progress-adv']}) for two provers: $\mu = \texttt{Best-of-4}(\pi_{5\text{x}5})$ and $\mu = \texttt{Best-of-4}(\pi_{5\text{x}5\text{-mixed-with-}10\text{x}10})$, where the models were trained on Deg-5-Path-5 alone or mixed with Deg-10-Path-10, respectively. Both models provide non-zero step advantages for Progress Rewards due to their reasonable success rates on the harder task. (Right): Success rate on a held-out test set of Degree-10-Path-10 examples. Despite using the same two provers, both models fail on the Degree-10-Path-10 task. We believe this is because the prover policy is not well aligned with the policy being optimized.
  • Figure 4: Using a lower KL coefficient, i.e., the standard value of 0.001 in Best-of-N aware finetuning, results in unstable training due to large-magnitude negative gradients, causing model responses to degenerate into repeating the same character. In contrast, using a KL schedule as recommended in chow2024inference (decaying from a strong KL penalty of $0.1$ to $0.001$) remains stable but fails to learn, as success rates stay at zero (see figure on the right).
  • Figure 5: (Left): Rewards that Qwen2.5/Qwen-1.5B-Instruct model obtains while training Dr.GRPO on a dataset containing an equal mixture of (i): Degree-5-Path-2 mixed with Degree-10-Path-10, and (ii): Degree-2-Path-5 mixed with Degree-10-Path-10. The training rewards saturate to around $0.5$ in both cases, and in both cases the model learns to solve the easier examples in the mixture. (Right): Success rate on a held-out test set of Degree-10-Path-10 examples. Both mixtures do not help the model solve the harder Degree-10-Path-10 task.
  • ...and 2 more figures