Table of Contents
Fetching ...

ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models

Razvan-Gabriel Dumitru, Darius Peteleaza, Vikas Yadav, Liangming Pan

TL;DR

ConciseRL introduces a semantic, hyperparameter-free conciseness reward for RL-based reasoning in large language models, evaluated by an LLM judge to encourage correct yet concise reasoning traces. The method defines $R_c(y,x)=C(y)$ and $R_{ac}(y,x)=A(y,x)\times C(y)$, optimized via PPO with a Leave-One-Out baseline, enabling adaptive control of reasoning length based on problem difficulty. Empirical results across GSM8K, MATH500, TheoremQA, GPQA-main, and MMLU-Pro-1k show substantial token reductions (up to 31× on easy problems and up to 3.6× on hard problems) with accuracy gains (e.g., up to +7% on MATH500 and +2.2% on TheoremQA); the judge model quality significantly influences outcomes, with stronger judges yielding better efficiency–accuracy trade-offs. The approach is open-sourced and demonstrates that semantic conciseness signals can outperform static length penalties, improving interpretability and reducing inference costs while maintaining or enhancing performance.

Abstract

Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/ConciseRL.

ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models

TL;DR

ConciseRL introduces a semantic, hyperparameter-free conciseness reward for RL-based reasoning in large language models, evaluated by an LLM judge to encourage correct yet concise reasoning traces. The method defines and , optimized via PPO with a Leave-One-Out baseline, enabling adaptive control of reasoning length based on problem difficulty. Empirical results across GSM8K, MATH500, TheoremQA, GPQA-main, and MMLU-Pro-1k show substantial token reductions (up to 31× on easy problems and up to 3.6× on hard problems) with accuracy gains (e.g., up to +7% on MATH500 and +2.2% on TheoremQA); the judge model quality significantly influences outcomes, with stronger judges yielding better efficiency–accuracy trade-offs. The approach is open-sourced and demonstrates that semantic conciseness signals can outperform static length penalties, improving interpretability and reducing inference costs while maintaining or enhancing performance.

Abstract

Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/ConciseRL.

Paper Structure

This paper contains 30 sections, 5 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: MATH500 histogram by difficulty level. We report both accuracy (blue, left axis) and average token length (green, right axis) for each method. All methods are based on DeepSeek-R1-Distill-Qwen-1.5B. For our method ("ConciseRL" and "ConciseRL (Separated)"), we use GPT-4.1 mini as the judge. The exact values shown in the histogram are reported in Table \ref{['tab:math_difficulty_levels']}.
  • Figure 2: Given an input prompt, an LLM generates multiple reasoning traces that are evaluated by an LLM-based judge who scores each trace based on conciseness. Trace 1 is concise and receives the highest reward, Trace 2 has an equal length (24 tokens) but lower conciseness, while Trace 3 is the longest (71 tokens) and least concise. These rewards then guide a policy gradient update.
  • Figure 3: Training metrics across steps using DeepSeek-R1-Distill-Qwen-1.5B as the base model. The Y-axes show accuracy (higher is better) and response length in tokens (lower is better).
  • Figure 4: Training metrics across steps using different models as the judge. The Y-axes show reward values and conciseness scores assigned by the judge.
  • Figure 5: Training metrics across steps using DeepSeek-R1-Distill-Qwen-1.5B deepseekai2025deepseekr1incentivizingreasoningcapability as the base model and different models as the judge. The Y-axes show accuracy (higher is better; left) and response length in tokens (lower is better; right). The X-axis in both cases shows the training step.
  • ...and 13 more figures