Table of Contents
Fetching ...

Optimizing Length Compression in Large Reasoning Models

Zhengxiang Cheng, Dongping Chen, Mingyang Fu, Tianyi Zhou

TL;DR

This work tackles the inefficiency of verbose, nonessential reasoning in Large Reasoning Models by identifying invalid thinking patterns that persist after reaching a correct answer. It introduces Brevity and Sufficiency as principled guides and presents LC-R1, a GRPO-based post-training method that optimizes for overall conciseness (Length Reward) and targeted removal of redundancy (Compress Reward). Across multiple benchmarks and backbones, LC-R1 achieves substantial sequence-length reductions (around 46%) with only a small accuracy drop (~2%), while maintaining high Valid Thinking rates and robust generalization. The results offer a practical pathway to deploy more compute-efficient LRMs without sacrificing core reasoning capabilities.

Abstract

Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as "invalid thinking" -- models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

Optimizing Length Compression in Large Reasoning Models

TL;DR

This work tackles the inefficiency of verbose, nonessential reasoning in Large Reasoning Models by identifying invalid thinking patterns that persist after reaching a correct answer. It introduces Brevity and Sufficiency as principled guides and presents LC-R1, a GRPO-based post-training method that optimizes for overall conciseness (Length Reward) and targeted removal of redundancy (Compress Reward). Across multiple benchmarks and backbones, LC-R1 achieves substantial sequence-length reductions (around 46%) with only a small accuracy drop (~2%), while maintaining high Valid Thinking rates and robust generalization. The results offer a practical pathway to deploy more compute-efficient LRMs without sacrificing core reasoning capabilities.

Abstract

Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as "invalid thinking" -- models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

Paper Structure

This paper contains 39 sections, 11 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison between inefficient reasoning model and efficient model. The former tends to make a verbose self-check process after having derived the correct answer corresponding to the given question. The model trained with LC-R1 get more efficient reasoning process to get correct answer, without any invalid thinking process.
  • Figure 2: Pareto analysis of the Efficacy-Efficiency trade-off of different methods on two reasoning models. The x-axis represents the reasoning length change, and the y-axis shows the accuracy change, relative to the original model (defined in Eq. \ref{['metric']}), with the top-left corner representing the ideal position. A smaller and darker marker indicates a higher Valid Thinking (VT) rate (defined in Eq. \ref{['valid thinking']}), signifying a more efficient thinking process. Compared to other methods also on the pareto frontier, LC-R1 achieves a more favorable trade-off, attaining a substantially higher compression rate at the cost of a minimal drop in accuracy, and it also achieves a higher VT rate. The sub-optimal performance of our ablation variants (w/o C-reward, w/o L-reward) further proves the criticality of our dual-reward designs.
  • Figure 3: An overview of the LC-R1 training three-stage pipeline.(1) Valid Segment Extraction: First, an extractor model processes the original reasoning traces to identify the valid thinking portion and generate compressed sequences. (2) Reward Calculation: Next, these compressed sequences are used to compute our dual rewards—Length Reward and Compress Reward, with the latter applied exclusively as a bonus or penalty on the final </think> token. These are then combined to calculate the final advantages. (3) Policy Optimization: Finally, the GRPO loss is calculated using the compressed sequences and corresponding advantages, steering the model toward more concise and efficient reasoning.
  • Figure 4: The impact of LC-R1 compression method on the AIME25 benchmark. Left: The Pass@k scores show that LC-R1 models maintain competitive performance compared to the originals, preserving the model's potential. Right: Per-problem analysis on Deepseek-R1-Distill-Qwen-7B reveals that LC-R1 achieves similar Pass@1 accuracy while maintaining a consistent token compression ratio across problems of varying difficulty, demonstrating a universal compression effect.
  • Figure 5: Our prompt for extraction of answer prefix.
  • ...and 3 more figures