Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

Ziyan Wang; Enmao Diao; Qi Le; Pu Wang; Guanchu Wang; Minwoo Lee; Shu-ping Yeh; Li Yang

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang

TL;DR

Reasoning LLMs deliver strong multi-step inference but are costly to deploy; standard pruning degrades their reasoning. The authors identify calibration-data misalignment and decode-time reasoning as core issues and propose RESP, a self-reflective structured pruning framework using self-generated calibration traces, decode-only gradient-based importance, and progressive calibration regeneration. Empirical results on Qwen3-8B across GSM8K and MathQA show RESP preserves near-dense accuracy at 20-30% sparsity and significantly mitigates performance collapse at higher sparsity, outperforming strong baselines. This work enables practical, hardware-friendly pruning of reasoning-focused LLMs for resource-constrained settings.

Abstract

Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

TL;DR

Abstract

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)