Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models
Yang Xiang, Yixin Ji, Juntao Li, Min Zhang
TL;DR
This work addresses the efficiency challenge of large reasoning models (LRMs) by studying pruning and calibration data. It shows that directly applying LLM pruning methods to LRMs degrades reasoning performance, and proposes Selective Self-Generated Reasoning (SSGR) calibration data to preserve reasoning while pruning. Through systematic experiments on DeepSeek-R1-Distill-Qwen-7B and -Llama-8B, the authors demonstrate that SSGR—especially with moderately long, difficult self-generated reasoning samples—consistently outperforms traditional calibration data, achieving up to roughly $11$–$14\%$ improvements over baseline calibration and better cross-domain generalization to GPQA. The results highlight the practical impact of data-driven calibration for pruning LRMs and suggest that tuning the difficulty and length of calibration data is crucial for maintaining reasoning capabilities after pruning.
Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.
