Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives
Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang
TL;DR
This work tackles signal loss in RL-based LLM reasoning caused by undersampling, and introduces a principled non-linear objective framework that weights prompts by difficulty. It then develops Reinforce-Ada, with Est and Seq realizations, which adaptively allocates inference budgets to harder prompts, recovering learning signals without resorting to uniformly large group sizes. Across multiple backbones and math benchmarks, Reinforce-Ada accelerates convergence (up to ~2x) while maintaining the same total budget, and improves the reward-entropy trade-off by preserving policy diversity. The approach provides a scalable, plug-and-play alternative to passive filtering, with strong theoretical grounding and robust empirical gains. This framework could broadly improve data-efficiency in online RL for LLMs and related non-linear objective settings.
Abstract
Reinforcement learning (RL) for large language model reasoning is frequently hindered by signal loss, a phenomenon where standard uniform sampling with small group sizes fails to uncover informative learning signals for difficult prompts. We demonstrate that this collapse is a statistical artifact of undersampling rather than an inherent model limitation. To address this systematically, we introduce a theoretical framework based on optimizing a non-linear RL objective (e.g., log-likelihood). We show that this objective naturally induces a weighted gradient estimator that prioritizes difficult prompts, which can be robustly realized through adaptive sampling. Guided by this framework, we propose Reinforce-Ada, a family of algorithms that dynamically allocates inference budgets based on prompt difficulty, effectively scaling up RL compute to where it is needed most. Unlike passive filtering methods that discard low-signal prompts, Reinforce-Ada actively invests compute to recover them. We introduce two efficient realizations: an estimation-based approach and a model-free sequential sampling approach. Extensive experiments across multiple benchmarks show that Reinforce-Ada significantly outperforms uniform baselines like GRPO, recovering lost signals and accelerating convergence by up to $2\times$ while maintaining the same total inference budget. Code is available at https://github.com/RLHFlow/Reinforce-Ada.
