Table of Contents
Fetching ...

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

Zhong-Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Xing W, Haizhen Huang, Weiwei Deng, Yeyun Gong, Zhijiang Guo, Xiao Liu, Fei Yin, Cheng-Lin Liu

TL;DR

TlDr introduces a dynamic thinking-length reweighting framework to compress LLM reasoning by adaptively balancing short CoT (System-1) and long CoT (System-2) data during post-training. By estimating upper bounds on efficiency and accuracy and updating data ratios in real time, TlDr achieves around 40% token reduction on DeepSeek-R1-Distill-7B/14B with little degradation in reasoning performance. The method outperforms static data mixtures and token-budgeted baselines, while requiring simpler data construction and no extensive problem-by-problem annotations. This approach offers a practical path toward efficient, scalable reasoning in large language models for diverse benchmarks and problem difficulties.

Abstract

Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model's System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model's reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

TL;DR

TlDr introduces a dynamic thinking-length reweighting framework to compress LLM reasoning by adaptively balancing short CoT (System-1) and long CoT (System-2) data during post-training. By estimating upper bounds on efficiency and accuracy and updating data ratios in real time, TlDr achieves around 40% token reduction on DeepSeek-R1-Distill-7B/14B with little degradation in reasoning performance. The method outperforms static data mixtures and token-budgeted baselines, while requiring simpler data construction and no extensive problem-by-problem annotations. This approach offers a practical path toward efficient, scalable reasoning in large language models for diverse benchmarks and problem difficulties.

Abstract

Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model's System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model's reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.

Paper Structure

This paper contains 31 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Impact of Combining Short CoT and Long CoT in Fixed Ratios on Thinking Compression Performance and Token Cost. We assessed the variation decay rate in output token length and accuracy on datasets of various question difficulty, spanning from GSM8K to AIME. The Normalized Token/Acc metric detail please refer to Equ. \ref{['equ:normlized_acc']} and Equ. \ref{['equ:normlized_token']}.
  • Figure 2: Overview of TlDr: Starting with a System-2 model, we iteratively update it on both Short-CoT and Long-CoT samples. The ratios of both data sources are adjusted every several steps based on the current average model accuracy and token length from the validation set until convergence.
  • Figure 3: Comparison of accuracy and generation length between Vanilla CoT and our TlDr method on four benchmark datasets (GSM8K, MATH500, AIME, AMC) using DeepSeek-R1-Distill-Qwen models. TlDr consistently reduces generation length while maintaining or improving accuracy across both 7B and 14B model scales.
  • Figure 4: Frequency comparison of different keywords. The figure illustrates the distribution of exploratory, checking, and reflective keywords across datasets. Exploratory Keywords: wait, Reflective Word: but, Checking Words: make sure/confirm/verify/check, TlDr significantly reduces the presence of such words, reflecting its ability to produce streamlined and efficient reasoning steps.
  • Figure 5: Evaluation Prompt for GSM8K, MATH500, AIME24, AMC, MinervaMath
  • ...and 3 more figures