Efficient Reasoning via Reward Model
Yuhao Wang, Xiaopeng Li, Cheng Gong, Ziru Liu, Suiyun Zhang, Rui Liu, Xiangyu Zhao
TL;DR
This work addresses the inefficiency of verbose reasoning in large reasoning models by identifying length-collapse and training-collapse when using length penalties in RLVR. It introduces a two-component framework: a Conciseness Reward Model (CRM) to score reasoning conciseness and a Conciseness Reward Function (CRF) that ties the conciseness score to the outcome reward, enabling variance reduction and faster convergence. The approach yields significant gains in accuracy and token efficiency across multiple mathematical benchmarks and backbones (e.g., +8.1% accuracy and -19.9% tokens on Qwen2.5-7B), and demonstrates strong generalization to Llama and Mistral. The work includes theoretical proofs of variance reduction and convergence and provides publicly available code for reproducibility, underscoring its practical impact for efficient, high-quality mathematical reasoning in LRMs.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs), enabling the development of large reasoning models (LRMs). However, LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking-which substantially increases computational costs. Prior efforts to mitigate this issue commonly incorporate length penalties into the reward function, but we find they frequently suffer from two critical issues: length collapse and training collapse, resulting in sub-optimal performance. To address them, we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path. Additionally, we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score, thereby fostering both more effective and more efficient reasoning. From a theoretical standpoint, we demonstrate the superiority of the new reward from the perspective of variance reduction and improved convergence properties. Besides, on the practical side, extensive experiments on five mathematical benchmark datasets demonstrate the method's effectiveness and token efficiency, which achieves an 8.1% accuracy improvement and a 19.9% reduction in response token length on Qwen2.5-7B. Furthermore, the method generalizes well to other LLMs including Llama and Mistral. The implementation code and datasets are publicly available for reproduction: https://anonymous.4open.science/r/CRM.
