Table of Contents
Fetching ...

Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models

Yang Xiang, Yixin Ji, Juntao Li, Min Zhang

TL;DR

This work addresses the efficiency challenge of large reasoning models (LRMs) by studying pruning and calibration data. It shows that directly applying LLM pruning methods to LRMs degrades reasoning performance, and proposes Selective Self-Generated Reasoning (SSGR) calibration data to preserve reasoning while pruning. Through systematic experiments on DeepSeek-R1-Distill-Qwen-7B and -Llama-8B, the authors demonstrate that SSGR—especially with moderately long, difficult self-generated reasoning samples—consistently outperforms traditional calibration data, achieving up to roughly $11$–$14\%$ improvements over baseline calibration and better cross-domain generalization to GPQA. The results highlight the practical impact of data-driven calibration for pruning LRMs and suggest that tuning the difficulty and length of calibration data is crucial for maintaining reasoning capabilities after pruning.

Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.

Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models

TL;DR

This work addresses the efficiency challenge of large reasoning models (LRMs) by studying pruning and calibration data. It shows that directly applying LLM pruning methods to LRMs degrades reasoning performance, and proposes Selective Self-Generated Reasoning (SSGR) calibration data to preserve reasoning while pruning. Through systematic experiments on DeepSeek-R1-Distill-Qwen-7B and -Llama-8B, the authors demonstrate that SSGR—especially with moderately long, difficult self-generated reasoning samples—consistently outperforms traditional calibration data, achieving up to roughly improvements over baseline calibration and better cross-domain generalization to GPQA. The results highlight the practical impact of data-driven calibration for pruning LRMs and suggest that tuning the difficulty and length of calibration data is crucial for maintaining reasoning capabilities after pruning.

Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.

Paper Structure

This paper contains 24 sections, 1 equation, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Performance of the pruned DeepSeek-R1-Distill-Qwen-7B model using SparseGPT with different calibration data (C4, MathR, and SSGR) across five reasoning benchmarks: GSM8K, MATH500, OlympiadBench, AIME, and GPQA Diamond.
  • Figure 2: Comparison of pruning performance using three different types of calibration data for the dense model. For pre-training data (C4, Wikipedia, and DCLM), we report the highest accuracy among the three on each benchmark.
  • Figure 3: Comparison of Pruning Performance Using Calibration Data of Varying Difficulty Levels (Easy, Medium, and Hard).
  • Figure 4: Pruning performance under various pruning settings with different calibration data.
  • Figure 5: Average response length of pruned LRMs across different benchmarks.