Table of Contents
Fetching ...

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

Abdelaziz Bounhar, Hadi Abdine, Evan Dufraisse, Ahmad Chamma, Amr Mohamed, Dani Bouch, Michalis Vazirgiannis, Guokan Shang

TL;DR

This paper tackles the problem of excessive verbosity in step-by-step reasoning by RLVR-trained LLMs. It shows that retaining moderately easy problems acts as an implicit length regularizer, producing emergent brevity without explicit length penalties. The authors introduce GRPO and a two-stage curriculum RLVR on math data, combining verifiable rewards with data curation to maintain accuracy while dramatically reducing output length on reasoning tasks. The approach achieves competitive or improved efficiency on math benchmarks, suggesting that thoughtful data curation can reconcile concision with high reasoning performance in constrained decoding settings.

Abstract

Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out ``easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a \textbf{model that conflates ``thinking longer'' with ``thinking better''}. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \textbf{\emph{emergent brevity for free}}: the model learns to solve harder problems without inflating the output length, \textbf{ despite the absence of any explicit length penalization}. RLVR experiments using this approach on \textit{Qwen3-4B-Thinking-2507} (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at \href{https://github.com/MBZUAI-Paris/Frugal-AI}{GitHub}, with datasets and models on \href{https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc}{Hugging Face}.

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

TL;DR

This paper tackles the problem of excessive verbosity in step-by-step reasoning by RLVR-trained LLMs. It shows that retaining moderately easy problems acts as an implicit length regularizer, producing emergent brevity without explicit length penalties. The authors introduce GRPO and a two-stage curriculum RLVR on math data, combining verifiable rewards with data curation to maintain accuracy while dramatically reducing output length on reasoning tasks. The approach achieves competitive or improved efficiency on math benchmarks, suggesting that thoughtful data curation can reconcile concision with high reasoning performance in constrained decoding settings.

Abstract

Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out ``easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a \textbf{model that conflates ``thinking longer'' with ``thinking better''}. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \textbf{\emph{emergent brevity for free}}: the model learns to solve harder problems without inflating the output length, \textbf{ despite the absence of any explicit length penalization}. RLVR experiments using this approach on \textit{Qwen3-4B-Thinking-2507} (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at \href{https://github.com/MBZUAI-Paris/Frugal-AI}{GitHub}, with datasets and models on \href{https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc}{Hugging Face}.

Paper Structure

This paper contains 23 sections, 13 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Empirical success-rate analysis.
  • Figure 2: Training dynamics during Stage 1 (emergent brevity). Early training is dominated by overly long, truncated generations with high entropy and low accuracy. As learning progresses, average response length and clip ratio decrease sharply, entropy stabilizes, and validation accuracy on AIME25 improves steadily—showing that conciseness and correctness co-emerge.
  • Figure 3: Scaling behavior under varying generation budgets (8 k → 16 k → 32 k → 42 k). The top panels show Pass@1 accuracy and the bottom panels show Efficiency-Adjusted Accuracy for the three benchmarks; AIME25, GSM Plus, and Omni-Hard.
  • Figure 4: Distribution $\rho(p)$ after scaling maximum response length to 42k tokens.

Theorems & Definitions (2)

  • Remark 1
  • Definition 1: Efficiency Adjusted Accuracy (EAA)