Table of Contents
Fetching ...

ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy

Gengyang Li, Yifeng Gao, Yuming Li, Yunfang Wu

TL;DR

ThinkLess introduces a training-free, inference-efficient framework to reduce chain-of-thought reasoning overhead in LLMs by terminating reasoning early and relying on a lightweight post-regulation to maintain output quality. Attention analyses show that final answers rely minimally on earlier reasoning steps and instead focus on the reasoning terminator, supporting early termination. The method requires no model fine-tuning or extra data and achieves comparable accuracy to full CoT decoding while substantially reducing decoding time and KV-cache usage. Across several backbones and benchmarks, ThinkLess demonstrates strong efficiency-accuracy trade-offs, enabling practical deployment of CoT-like reasoning in real-world systems.

Abstract

While Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), the excessive length of reasoning tokens increases latency and KV cache memory usage, and may even truncate final answers under context limits. We propose ThinkLess, an inference-efficient framework that terminates reasoning generation early and maintains output quality without modifying the model. Atttention analysis reveals that answer tokens focus minimally on earlier reasoning steps and primarily attend to the reasoning terminator token, due to information migration under causal masking. Building on this insight, ThinkLess inserts the terminator token at earlier positions to skip redundant reasoning while preserving the underlying knowledge transfer. To prevent format discruption casued by early termination, ThinkLess employs a lightweight post-regulation mechanism, relying on the model's natural instruction-following ability to produce well-structured answers. Without fine-tuning or auxiliary data, ThinkLess achieves comparable accuracy to full-length CoT decoding while greatly reducing decoding time and memory consumption.

ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy

TL;DR

ThinkLess introduces a training-free, inference-efficient framework to reduce chain-of-thought reasoning overhead in LLMs by terminating reasoning early and relying on a lightweight post-regulation to maintain output quality. Attention analyses show that final answers rely minimally on earlier reasoning steps and instead focus on the reasoning terminator, supporting early termination. The method requires no model fine-tuning or extra data and achieves comparable accuracy to full CoT decoding while substantially reducing decoding time and KV-cache usage. Across several backbones and benchmarks, ThinkLess demonstrates strong efficiency-accuracy trade-offs, enabling practical deployment of CoT-like reasoning in real-world systems.

Abstract

While Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), the excessive length of reasoning tokens increases latency and KV cache memory usage, and may even truncate final answers under context limits. We propose ThinkLess, an inference-efficient framework that terminates reasoning generation early and maintains output quality without modifying the model. Atttention analysis reveals that answer tokens focus minimally on earlier reasoning steps and primarily attend to the reasoning terminator token, due to information migration under causal masking. Building on this insight, ThinkLess inserts the terminator token at earlier positions to skip redundant reasoning while preserving the underlying knowledge transfer. To prevent format discruption casued by early termination, ThinkLess employs a lightweight post-regulation mechanism, relying on the model's natural instruction-following ability to produce well-structured answers. Without fine-tuning or auxiliary data, ThinkLess achieves comparable accuracy to full-length CoT decoding while greatly reducing decoding time and memory consumption.

Paper Structure

This paper contains 24 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: GPQA rein2024gpqa accuracy of DeepSeek-R1-Distill-LLaMA-8B guo2025deepseek under varying token budgets. Red: ThinkLess (compressed reasoning); Blue: full CoT reasoning.The left part of the legend illustrates the relationship between marker size and latency, the middle part denotes each methods, and the right part presents the maximum accuracy and corresponding latency of each method.
  • Figure 2: Attention heatmaps across different layers of DeepSeek-R1-Distill-LLaMA-8B on a GSM8K sample cobbe2021training. Tokens within the <$\mathrm{think}$>...<$\mathrm{/think}$> span receive uniform attention in early layers, but deeper layers gradually shift focus to the boundary tokens, indicating information migration and compression of reasoning content. Similar observations can be found in other models and datasets
  • Figure 3: We insert a <$\mathrm{/think}$> token every 16 tokens in DeepSeek-R1-Distill-Qwen-7B and extract last-layer hidden states. These states are highly similar (0.9) across segments, showing that reasoning adds little new information. The final state is also similar to earlier ones, indicating early convergence and redundancy in later reasoning. Similar observations can be found across other models and datasets. Best view with zooming in.
  • Figure 4: Accuracy of DeepSeek-R1-Distill-Qwen-7B vs. position where <$\mathrm{/think}$> is inserted. The benchmark is BBH dataset suzgun2022challenging.
  • Figure 5: Top@$k$ accuracy of ThinkLess vs. Top@$1$ accuracy of DeepSeek-distilled models across datasets and models. We set $k = \frac{\text{Token Budget}}{512}$ to match the token usage on par with distilled models. Legends follow Figure \ref{['fig:illustration']}.
  • ...and 1 more figures