Table of Contents
Fetching ...

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang

TL;DR

Reasoning LLMs deliver strong multi-step inference but are costly to deploy; standard pruning degrades their reasoning. The authors identify calibration-data misalignment and decode-time reasoning as core issues and propose RESP, a self-reflective structured pruning framework using self-generated calibration traces, decode-only gradient-based importance, and progressive calibration regeneration. Empirical results on Qwen3-8B across GSM8K and MathQA show RESP preserves near-dense accuracy at 20-30% sparsity and significantly mitigates performance collapse at higher sparsity, outperforming strong baselines. This work enables practical, hardware-friendly pruning of reasoning-focused LLMs for resource-constrained settings.

Abstract

Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

TL;DR

Reasoning LLMs deliver strong multi-step inference but are costly to deploy; standard pruning degrades their reasoning. The authors identify calibration-data misalignment and decode-time reasoning as core issues and propose RESP, a self-reflective structured pruning framework using self-generated calibration traces, decode-only gradient-based importance, and progressive calibration regeneration. Empirical results on Qwen3-8B across GSM8K and MathQA show RESP preserves near-dense accuracy at 20-30% sparsity and significantly mitigates performance collapse at higher sparsity, outperforming strong baselines. This work enables practical, hardware-friendly pruning of reasoning-focused LLMs for resource-constrained settings.

Abstract

Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

Paper Structure

This paper contains 12 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview for RESP. (a) Conventional post‑training methods use generic corpora for calibration, causing severe reasoning degradation. (b) RESP builds calibration from task‑specific prompts and self‑generated traces to better align the RLMs' decoding nature, achieving performance gain. (c) With regeneration, RESP further boosts performance under high sparsity.
  • Figure 2: Calibration source matters for pruning reasoning LLMs. (a)Observation 1: calibrating on a generic corpus (C4) causes severe degradation, while task-specific (GSM8K) calibration preserves accuracy. (b)Observation 2: Gradient-based method (GISP) benefits more from task-specific calibration compared with OBS-based methods (Others).
  • Figure 3: Illusration of (a) Observation 3: Calibration with self-generated reasoning traces outperform task-specific calibration. Furthermore, (b) the pruned model still has a significant performance drop after entering the high sparsity domain, motivating us to develop progressive regeneration. "Self-G" marks for self-generated calibration data.