Table of Contents
Fetching ...

Efficient Reasoning via Reward Model

Yuhao Wang, Xiaopeng Li, Cheng Gong, Ziru Liu, Suiyun Zhang, Rui Liu, Xiangyu Zhao

TL;DR

This work addresses the inefficiency of verbose reasoning in large reasoning models by identifying length-collapse and training-collapse when using length penalties in RLVR. It introduces a two-component framework: a Conciseness Reward Model (CRM) to score reasoning conciseness and a Conciseness Reward Function (CRF) that ties the conciseness score to the outcome reward, enabling variance reduction and faster convergence. The approach yields significant gains in accuracy and token efficiency across multiple mathematical benchmarks and backbones (e.g., +8.1% accuracy and -19.9% tokens on Qwen2.5-7B), and demonstrates strong generalization to Llama and Mistral. The work includes theoretical proofs of variance reduction and convergence and provides publicly available code for reproducibility, underscoring its practical impact for efficient, high-quality mathematical reasoning in LRMs.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs), enabling the development of large reasoning models (LRMs). However, LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking-which substantially increases computational costs. Prior efforts to mitigate this issue commonly incorporate length penalties into the reward function, but we find they frequently suffer from two critical issues: length collapse and training collapse, resulting in sub-optimal performance. To address them, we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path. Additionally, we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score, thereby fostering both more effective and more efficient reasoning. From a theoretical standpoint, we demonstrate the superiority of the new reward from the perspective of variance reduction and improved convergence properties. Besides, on the practical side, extensive experiments on five mathematical benchmark datasets demonstrate the method's effectiveness and token efficiency, which achieves an 8.1% accuracy improvement and a 19.9% reduction in response token length on Qwen2.5-7B. Furthermore, the method generalizes well to other LLMs including Llama and Mistral. The implementation code and datasets are publicly available for reproduction: https://anonymous.4open.science/r/CRM.

Efficient Reasoning via Reward Model

TL;DR

This work addresses the inefficiency of verbose reasoning in large reasoning models by identifying length-collapse and training-collapse when using length penalties in RLVR. It introduces a two-component framework: a Conciseness Reward Model (CRM) to score reasoning conciseness and a Conciseness Reward Function (CRF) that ties the conciseness score to the outcome reward, enabling variance reduction and faster convergence. The approach yields significant gains in accuracy and token efficiency across multiple mathematical benchmarks and backbones (e.g., +8.1% accuracy and -19.9% tokens on Qwen2.5-7B), and demonstrates strong generalization to Llama and Mistral. The work includes theoretical proofs of variance reduction and convergence and provides publicly available code for reproducibility, underscoring its practical impact for efficient, high-quality mathematical reasoning in LRMs.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs), enabling the development of large reasoning models (LRMs). However, LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking-which substantially increases computational costs. Prior efforts to mitigate this issue commonly incorporate length penalties into the reward function, but we find they frequently suffer from two critical issues: length collapse and training collapse, resulting in sub-optimal performance. To address them, we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path. Additionally, we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score, thereby fostering both more effective and more efficient reasoning. From a theoretical standpoint, we demonstrate the superiority of the new reward from the perspective of variance reduction and improved convergence properties. Besides, on the practical side, extensive experiments on five mathematical benchmark datasets demonstrate the method's effectiveness and token efficiency, which achieves an 8.1% accuracy improvement and a 19.9% reduction in response token length on Qwen2.5-7B. Furthermore, the method generalizes well to other LLMs including Llama and Mistral. The implementation code and datasets are publicly available for reproduction: https://anonymous.4open.science/r/CRM.

Paper Structure

This paper contains 28 sections, 24 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) Length collapse (b) Training collapse phenomenon. '#Tok' denotes the number of output tokens.
  • Figure 2: An overview of the proposed framework taking GRPO as an example.
  • Figure 3: Training curves of outcome reward and average number of tokens of the generated reasoning path on (a) Qwen2.5-7B, (b) Llama3.1-8B, and (c) Mistral-7B-v0.1
  • Figure 4: (a) Ablation study where the y-axis denotes the outcome reward on the validation set. (b) Hyper-parameter analysis on $\alpha$ where the left and right y-axis denote Pass@1 and number of tokens.
  • Figure 5: Prompt template used to evaluate conciseness of thinking process.
  • ...and 1 more figures