Table of Contents
Fetching ...

Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives

Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang

TL;DR

This work tackles signal loss in RL-based LLM reasoning caused by undersampling, and introduces a principled non-linear objective framework that weights prompts by difficulty. It then develops Reinforce-Ada, with Est and Seq realizations, which adaptively allocates inference budgets to harder prompts, recovering learning signals without resorting to uniformly large group sizes. Across multiple backbones and math benchmarks, Reinforce-Ada accelerates convergence (up to ~2x) while maintaining the same total budget, and improves the reward-entropy trade-off by preserving policy diversity. The approach provides a scalable, plug-and-play alternative to passive filtering, with strong theoretical grounding and robust empirical gains. This framework could broadly improve data-efficiency in online RL for LLMs and related non-linear objective settings.

Abstract

Reinforcement learning (RL) for large language model reasoning is frequently hindered by signal loss, a phenomenon where standard uniform sampling with small group sizes fails to uncover informative learning signals for difficult prompts. We demonstrate that this collapse is a statistical artifact of undersampling rather than an inherent model limitation. To address this systematically, we introduce a theoretical framework based on optimizing a non-linear RL objective (e.g., log-likelihood). We show that this objective naturally induces a weighted gradient estimator that prioritizes difficult prompts, which can be robustly realized through adaptive sampling. Guided by this framework, we propose Reinforce-Ada, a family of algorithms that dynamically allocates inference budgets based on prompt difficulty, effectively scaling up RL compute to where it is needed most. Unlike passive filtering methods that discard low-signal prompts, Reinforce-Ada actively invests compute to recover them. We introduce two efficient realizations: an estimation-based approach and a model-free sequential sampling approach. Extensive experiments across multiple benchmarks show that Reinforce-Ada significantly outperforms uniform baselines like GRPO, recovering lost signals and accelerating convergence by up to $2\times$ while maintaining the same total inference budget. Code is available at https://github.com/RLHFlow/Reinforce-Ada.

Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives

TL;DR

This work tackles signal loss in RL-based LLM reasoning caused by undersampling, and introduces a principled non-linear objective framework that weights prompts by difficulty. It then develops Reinforce-Ada, with Est and Seq realizations, which adaptively allocates inference budgets to harder prompts, recovering learning signals without resorting to uniformly large group sizes. Across multiple backbones and math benchmarks, Reinforce-Ada accelerates convergence (up to ~2x) while maintaining the same total budget, and improves the reward-entropy trade-off by preserving policy diversity. The approach provides a scalable, plug-and-play alternative to passive filtering, with strong theoretical grounding and robust empirical gains. This framework could broadly improve data-efficiency in online RL for LLMs and related non-linear objective settings.

Abstract

Reinforcement learning (RL) for large language model reasoning is frequently hindered by signal loss, a phenomenon where standard uniform sampling with small group sizes fails to uncover informative learning signals for difficult prompts. We demonstrate that this collapse is a statistical artifact of undersampling rather than an inherent model limitation. To address this systematically, we introduce a theoretical framework based on optimizing a non-linear RL objective (e.g., log-likelihood). We show that this objective naturally induces a weighted gradient estimator that prioritizes difficult prompts, which can be robustly realized through adaptive sampling. Guided by this framework, we propose Reinforce-Ada, a family of algorithms that dynamically allocates inference budgets based on prompt difficulty, effectively scaling up RL compute to where it is needed most. Unlike passive filtering methods that discard low-signal prompts, Reinforce-Ada actively invests compute to recover them. We introduce two efficient realizations: an estimation-based approach and a model-free sequential sampling approach. Extensive experiments across multiple benchmarks show that Reinforce-Ada significantly outperforms uniform baselines like GRPO, recovering lost signals and accelerating convergence by up to while maintaining the same total inference budget. Code is available at https://github.com/RLHFlow/Reinforce-Ada.

Paper Structure

This paper contains 43 sections, 17 equations, 10 figures, 5 tables, 2 algorithms.

Figures (10)

  • Figure 1: Plug-and-play usage. Left: a direct replacement of the generation API in verl (generate_sequences$\rightarrow$generate_multi_round_adaptive_downsampling). Right: with no other changes, Reinforce-Ada attains faster reward growth and a higher asymptote than GRPO.
  • Figure 2: Pass@k curves (left) and the ratio of prompts with all-correct responses (right) for two models on a subset of the Open-R1 prompt set. The models tested are the Qwen2.5-Math-1.5B base model and an intermediate checkpoint from its RL training. The percentage of prompts yielding all-correct/all-incorrect responses is high for small $k$ but drops significantly as $k$ increases. This suggests that signal loss is often a statistical artifact of small sample groups.
  • Figure 3: Visualization of different $f(t)$ and $f'(t)$. The concave functions $\log (t)$ and $\sqrt{t}$ assign larger weights $f'(t)$ to difficult prompts ($t\rightarrow0$).
  • Figure 4: Training reward vs. steps for GRPO and Reinforce-Ada across backbones: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama-3.2-3B-it, Qwen3-4B. Curves are smoothed with a 20-step moving average. In all cases, Reinforce-Ada learns faster and reaches a higher reward than GRPO, with the Balance variant typically achieving the highest asymptote.
  • Figure 5: First row: Sampling dynamics of different training strategies using the Qwen2.5-Math-1.5B model. We omit Reinforce-Ada-Est since its sampling cost matches that of GRPO-8. Second row: Sampling dynamics with the Qwen2.5-Math-1.5B model. Left: additional samples generated in later rounds compared to standard GRPO. Middle: number of prompts that remain active after multi-round adaptive sampling with the Reinforce-Ada-Seq-balance variant. Right: number of prompts that satisfy the stopping criteria within the first two rounds with the Reinforce-Ada-Seq-balance variant. All curves are smoothed using a moving average with a window size of $20$.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Example 1: Log objective $f(t)=\log(t)$
  • Example 2: Power function $f(t)=t^\alpha$, $\alpha>0$
  • Remark 1: Recovering the GRPO Advantage
  • Remark 2: The Necessity of the Log-Objective