ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models
Razvan-Gabriel Dumitru, Darius Peteleaza, Vikas Yadav, Liangming Pan
TL;DR
ConciseRL introduces a semantic, hyperparameter-free conciseness reward for RL-based reasoning in large language models, evaluated by an LLM judge to encourage correct yet concise reasoning traces. The method defines $R_c(y,x)=C(y)$ and $R_{ac}(y,x)=A(y,x)\times C(y)$, optimized via PPO with a Leave-One-Out baseline, enabling adaptive control of reasoning length based on problem difficulty. Empirical results across GSM8K, MATH500, TheoremQA, GPQA-main, and MMLU-Pro-1k show substantial token reductions (up to 31× on easy problems and up to 3.6× on hard problems) with accuracy gains (e.g., up to +7% on MATH500 and +2.2% on TheoremQA); the judge model quality significantly influences outcomes, with stronger judges yielding better efficiency–accuracy trade-offs. The approach is open-sourced and demonstrates that semantic conciseness signals can outperform static length penalties, improving interpretability and reducing inference costs while maintaining or enhancing performance.
Abstract
Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/ConciseRL.
