Table of Contents
Fetching ...

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, Wenjie Li

TL;DR

TokenSkip introduces a controllable CoT compression method by pruning semantically less important reasoning tokens and learning shortcuts between critical steps. By quantifying token importance with a lightweight compressor and fine-tuning via LoRA on mixed-ratio CoT data, TokenSkip achieves substantial reductions in CoT length (e.g., 40% on GSM8K with <0.4% accuracy loss for Qwen2.5-14B-Instruct) and notable speedups, while preserving reasoning quality across multiple models and math benchmarks. The approach includes training-time mixed-ratio data, a gamma-triggered pruning mechanism, and robust inference with a tunable ratio, plus extensive analyses on ratio adherence, importance distributions, and budgeted-length effects. Overall, TokenSkip offers a practical, low-cost path to scalable, efficient CoT reasoning in large LLMs, with strong empirical results and broad generalizability across models and tasks.

Abstract

Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT sequences during inference could further boost LLM reasoning performance. However, due to the autoregressive nature of LLM decoding, longer CoT outputs lead to a linear increase in inference latency, adversely affecting user experience, particularly when the CoT exceeds 10,000 tokens. To address this limitation, we analyze the semantic importance of tokens within CoT outputs and reveal that their contributions to reasoning vary. Building on this insight, we propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. Extensive experiments across various models and tasks demonstrate the effectiveness of TokenSkip in reducing CoT token usage while preserving strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct, TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less than a 0.4% performance drop. We release our code and checkpoints in https://github.com/hemingkx/TokenSkip.

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

TL;DR

TokenSkip introduces a controllable CoT compression method by pruning semantically less important reasoning tokens and learning shortcuts between critical steps. By quantifying token importance with a lightweight compressor and fine-tuning via LoRA on mixed-ratio CoT data, TokenSkip achieves substantial reductions in CoT length (e.g., 40% on GSM8K with <0.4% accuracy loss for Qwen2.5-14B-Instruct) and notable speedups, while preserving reasoning quality across multiple models and math benchmarks. The approach includes training-time mixed-ratio data, a gamma-triggered pruning mechanism, and robust inference with a tunable ratio, plus extensive analyses on ratio adherence, importance distributions, and budgeted-length effects. Overall, TokenSkip offers a practical, low-cost path to scalable, efficient CoT reasoning in large LLMs, with strong empirical results and broad generalizability across models and tasks.

Abstract

Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT sequences during inference could further boost LLM reasoning performance. However, due to the autoregressive nature of LLM decoding, longer CoT outputs lead to a linear increase in inference latency, adversely affecting user experience, particularly when the CoT exceeds 10,000 tokens. To address this limitation, we analyze the semantic importance of tokens within CoT outputs and reveal that their contributions to reasoning vary. Building on this insight, we propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. Extensive experiments across various models and tasks demonstrate the effectiveness of TokenSkip in reducing CoT token usage while preserving strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct, TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less than a 0.4% performance drop. We release our code and checkpoints in https://github.com/hemingkx/TokenSkip.
Paper Structure (33 sections, 7 equations, 12 figures, 5 tables)

This paper contains 33 sections, 7 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: In contrast to vanilla CoT that generates all reasoning tokens sequentially, TokenSkip enables LLMs to skip tokens with less semantic importance (e.g.,) and learn shortcuts between critical reasoning tokens, facilitating controllable CoT compression.
  • Figure 2: Visualization of token importance within a CoT sequence, with darker colors indicating higher values. This figure compares two token importance measurements: Selective Context and LLMLingua-2.
  • Figure 3: Recovering the compressed CoT for GSM8K math word problem using LLaMA-3.1-8B-Instruct.
  • Figure 4: Illustration of TokenSkip. During training, TokenSkip first generates CoT trajectories from the target LLM. These CoTs are then compressed to various ratios sampled from the ratio set. TokenSkip fine-tunes the LLM using compressed CoTs with mixed ratios, enabling controllable CoT inference at any desired $\gamma \in \left\{\gamma_0,\dots,\gamma_z\right\}$.
  • Figure 5: Compression performance of TokenSkip on Qwen2.5-Instruct models. Qwen2.5-14B-Instruct shows almost no performance drop with $\bm{40\%}$ token trimming.
  • ...and 7 more figures