Table of Contents
Fetching ...

RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning

Gang Li, Yulei Qin, Xiaoyu Tan, Dingkang Yang, Yuchen Shi, Zihan Xu, Xiang Li, Xing Sun, Ke Li

TL;DR

The paper tackles verbose reasoning in RLVR by introducing Rollout Response Recomposition (RoRecomp), a data-centric method that recomposes rollout outputs into two batch types to favor concise yet correct reasoning. By forming priority batches that emphasize short correct and long incorrect responses and compensatory replay-based batches to stabilize learning, RoRecomp provides clearer credit assignment without altering rewards. Across zero RL training, agentic RL, and thinking compression tasks, RoRecomp achieves substantial reductions in reasoning length and tool usage while maintaining competitive accuracy, demonstrating robustness across model scales and domains. This approach offers a practical, plug-and-play alternative to reward shaping for improving reasoning efficiency in LLM-based RL systems.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has proven effective in eliciting complex reasoning in large language models (LLMs). However, standard RLVR training often leads to excessively verbose processes (in reasoning tasks) and inefficient exploration trajectories (in agentic settings), as outcome-only rewards provide no incentive for efficiency and the high variance in response length within relatively small rollout groups results in noisy optimization signals. To address this, we propose Rollout Response Recomposition (RoRecomp), a plug-and-play method that guides models toward concise reasoning by strategically recomposing the training data. RoRecomp separates responses into two distinct batch types: 1) priority batches, which combine short-correct and long-incorrect responses selected from online batches to provide a clear gradient signal for brevity, and 2) compensation batches, which utilize remaining responses from a replay buffer to maintain stability and prevent model collapse. To comprehensively evaluate effectiveness, we test RoRecomp across three settings where results demonstrate substantial efficiency gains: reducing reasoning length by 27.7% in zero RL training, reducing unnecessary tool calls by 46.8% while improving accuracy in agentic RL, and achieving up to 52.5% length reduction in thinking compression, all with minimal performance impact.

RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning

TL;DR

The paper tackles verbose reasoning in RLVR by introducing Rollout Response Recomposition (RoRecomp), a data-centric method that recomposes rollout outputs into two batch types to favor concise yet correct reasoning. By forming priority batches that emphasize short correct and long incorrect responses and compensatory replay-based batches to stabilize learning, RoRecomp provides clearer credit assignment without altering rewards. Across zero RL training, agentic RL, and thinking compression tasks, RoRecomp achieves substantial reductions in reasoning length and tool usage while maintaining competitive accuracy, demonstrating robustness across model scales and domains. This approach offers a practical, plug-and-play alternative to reward shaping for improving reasoning efficiency in LLM-based RL systems.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has proven effective in eliciting complex reasoning in large language models (LLMs). However, standard RLVR training often leads to excessively verbose processes (in reasoning tasks) and inefficient exploration trajectories (in agentic settings), as outcome-only rewards provide no incentive for efficiency and the high variance in response length within relatively small rollout groups results in noisy optimization signals. To address this, we propose Rollout Response Recomposition (RoRecomp), a plug-and-play method that guides models toward concise reasoning by strategically recomposing the training data. RoRecomp separates responses into two distinct batch types: 1) priority batches, which combine short-correct and long-incorrect responses selected from online batches to provide a clear gradient signal for brevity, and 2) compensation batches, which utilize remaining responses from a replay buffer to maintain stability and prevent model collapse. To comprehensively evaluate effectiveness, we test RoRecomp across three settings where results demonstrate substantial efficiency gains: reducing reasoning length by 27.7% in zero RL training, reducing unnecessary tool calls by 46.8% while improving accuracy in agentic RL, and achieving up to 52.5% length reduction in thinking compression, all with minimal performance impact.

Paper Structure

This paper contains 14 sections, 4 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Comparison of RoRecomp and GRPO across three settings. (First row) Training dynamics demonstrate that RoRecomp significantly enhances reasoning efficiency by consistently reducing output length (Zero RL, Thinking Compression) or tool-use steps (Agentic RL). (Second row) This efficiency gain is achieved while maintaining comparable performance. Zero/Agentic RL training starts from Qwen2.5-7B; Thinking compression is trained on DeepSeek-R1-Distill-Qwen-7B.
  • Figure 2: (a) The overall framework of RoRecomp. After the response generation, we recompose candidate responses into two types of batches: priority batches and compensation batches. (b) The details of response selection. We select prioritized responses for each question by jointly considering the response length and reward.
  • Figure 3: Dynamics of zero RL training.
  • Figure 4: Effectiveness of compensation batches, with response length and performance averaged across six math test sets and reported at various training steps.