Table of Contents
Fetching ...

Pruning the Unsurprising: Efficient LLM Reasoning via First-Token Surprisal

Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, Xiaodong Gu

TL;DR

ASAP targets the inefficiency of large reasoning models by pruning long chain-of-thought traces. It reveals that reasoning information is concentrated at the start of each step and leverages this via anchor-guided pruning followed by first-token surprisal refinement to produce compact, logic-dense CoTs. The approach achieves a state-of-the-art Pareto frontier, delivering higher accuracy with substantially reduced token generation and latency across code and math benchmarks, while remaining transferably effective across architectures. This work advances practical deployment of large reasoning models by enabling efficient, coherent inference through information-theoretic signals for reasoning.

Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces pose substantial challenges for training cost and inference latency. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps because of the dilution of logical information. In this paper, we propose ASAP (Anchor-guided, SurprisAl-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. Leveraging the insight that logical branching choices are concentrated at the onset of reasoning steps, it then enables logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP distills the models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning. Experiments show that ASAP achieves state-of-the-art accuracy across multiple benchmarks while substantially reducing training and inference costs.

Pruning the Unsurprising: Efficient LLM Reasoning via First-Token Surprisal

TL;DR

ASAP targets the inefficiency of large reasoning models by pruning long chain-of-thought traces. It reveals that reasoning information is concentrated at the start of each step and leverages this via anchor-guided pruning followed by first-token surprisal refinement to produce compact, logic-dense CoTs. The approach achieves a state-of-the-art Pareto frontier, delivering higher accuracy with substantially reduced token generation and latency across code and math benchmarks, while remaining transferably effective across architectures. This work advances practical deployment of large reasoning models by enabling efficient, coherent inference through information-theoretic signals for reasoning.

Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces pose substantial challenges for training cost and inference latency. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps because of the dilution of logical information. In this paper, we propose ASAP (Anchor-guided, SurprisAl-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. Leveraging the insight that logical branching choices are concentrated at the onset of reasoning steps, it then enables logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP distills the models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning. Experiments show that ASAP achieves state-of-the-art accuracy across multiple benchmarks while substantially reducing training and inference costs.

Paper Structure

This paper contains 42 sections, 3 equations, 4 figures, 19 tables, 2 algorithms.

Figures (4)

  • Figure 1: Illustration of CoT pruning by ASAP. The Original CoT generated by LRMs exhibits two types of redundancy: (1) Structural Redundancy, such as digressive branches (highlighted in red dashed boxes), which are removed by our Stage 1 Anchor-guided pruning; and (2) Logical Redundancy within valid paths. ASAP addresses the latter in Stage 2 by computing the surprisal of the first tokens of reasoning steps (marked in blue) to identify and retain only the critical cognitive pivots.
  • Figure 2: Empirical analysis of 10M tokens from DeepSeek-R1-Distill-Qwen-32B. (a) The entropy distribution reveals a clear information concentration: first tokens (blue) exhibit significantly higher uncertainty (entropy) compared to body tokens (orange), which are highly deterministic. (b) The most frequent first tokens are a mixture of logical operators (e.g., Wait) and ubiquitous syntactic connectors (e.g., So). (c) High-entropy states filter out predictable fillers like So or Then, while exclusively highlighting cognitive pivots such as Perhaps, What, and Alternative.
  • Figure 3: The overall framework of ASAP. The pipeline consists of three phases: (1) In Stage 1, the LLM generates a "Direct Thought" ($\mathcal{P}$) from the (Question, Answer) pair. $\mathcal{P}$ acts as an anchor to prune the "Original CoT" ($C$) into a "Coarse-grained Pruned CoT" ($C_{coarse}$). (2) In Stage 2, we compute the First-Token Surprisal for each step in $C_{coarse}$. High-surprisal steps are retained, while low-surprisal fillers are pruned, yielding the final "Fine-grained Pruned CoT" ($C'$). (3) In Training Stage, the data with ASAP pruned CoTs is used to fine-tune the LRM for efficient inference.
  • Figure 4: Performance of ASAP on LiveCodeBench v4_v5 under different token budgets.