Pruning the Unsurprising: Efficient LLM Reasoning via First-Token Surprisal
Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, Xiaodong Gu
TL;DR
ASAP targets the inefficiency of large reasoning models by pruning long chain-of-thought traces. It reveals that reasoning information is concentrated at the start of each step and leverages this via anchor-guided pruning followed by first-token surprisal refinement to produce compact, logic-dense CoTs. The approach achieves a state-of-the-art Pareto frontier, delivering higher accuracy with substantially reduced token generation and latency across code and math benchmarks, while remaining transferably effective across architectures. This work advances practical deployment of large reasoning models by enabling efficient, coherent inference through information-theoretic signals for reasoning.
Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces pose substantial challenges for training cost and inference latency. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps because of the dilution of logical information. In this paper, we propose ASAP (Anchor-guided, SurprisAl-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. Leveraging the insight that logical branching choices are concentrated at the onset of reasoning steps, it then enables logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP distills the models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning. Experiments show that ASAP achieves state-of-the-art accuracy across multiple benchmarks while substantially reducing training and inference costs.
