Stable Reinforcement Learning for Efficient Reasoning
Muzhi Dai, Shixuan Liu, Qingyi Si
TL;DR
This work addresses instability in reinforcement learning for reasoning with chain-of-thought (CoT) prompts caused by length-penalty rewards. It introduces GRPO-$\lambda$, a dynamic reward strategy that adjusts between efficiency-focused and accuracy-focused optimization by evaluating batch-wise group correctness and applying a top-$\lambda$ selection. Empirical results across GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 show a 1.48% accuracy improvement with a 47.3% reduction in CoT length, while significantly extending viable training iterations (at least 2.5×). The approach offers a practical path to stable, efficient reasoning in post-training RL for LLMs and suggests design guidelines for balancing CoT quality and throughput.
Abstract
The success of Deepseek-R1 has drawn the LLM community's attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models' behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO-$λ$, an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions within each query-sampled group. A low correctness ratio indicates the need to avoid length penalty that compromises CoT quality, triggering a switch to length-agnostic 0/1 rewards that prioritize reasoning capability. A high ratio maintains length penalties to boost efficiency. Experimental results show that our approach avoids training instability caused by length penalty while maintaining the optimal accuracy-efficiency trade-off. On the GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 benchmarks, it improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.
