Table of Contents
Fetching ...

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Nan Zhang, Eugene Kwek, Yusen Zhang, Ngoc-Hieu Nguyen, Prasenjit Mitra, Rui Zhang

TL;DR

This paper addresses how compression (quantization, distillation, pruning) impacts the reasoning and knowledge capabilities of large reasoning models (LRMs). It combines performance benchmarking on four challenging reasoning datasets with mechanistic interpretability, using weight-level analyses to map how individual components contribute to reasoning tasks. Key findings show that weight count more strongly affects memorization than reasoning, the final-layer up-projection is among the most critical weights for reasoning, and that current quantization overly compresses final-layer modules and gating projections—yet protecting a small fraction of weights can yield substantial accuracy gains. These insights offer practical guidance for designing compression pipelines that preserve both knowledge and reasoning, informing targeted preservation strategies for LRMs.

Abstract

Compression methods, including quantization, distillation, and pruning, improve the computational efficiency of large reasoning models (LRMs). However, existing studies either fail to sufficiently compare all three compression methods on LRMs or lack in-depth interpretation analysis. In this paper, we investigate how the reasoning capabilities of LRMs are compromised during compression, through performance benchmarking and mechanistic interpretation. To uncover the effects of compression on reasoning performance, we benchmark quantized, distilled, and pruned DeepSeek-R1 models on four reasoning datasets (AIME 2024, FOLIO, Temporal Sequences, and MuSiQue). To precisely locate compression effects on model weights, we adapt difference of means and attribution patching techniques, focusing on the activation of every linear component in compressed LRMs, to interpret fine-grained causal relationships between weights and various reasoning capabilities. This fine-grained interpretation addresses a fundamental question of compression: which weights are the most important for reasoning? Overall, we find dynamically quantized 2.51-bit R1 reaches close-to-R1 performance. With empirical verification, we present three main findings that generalize across both Llama and Qwen: (1) Weight count has a greater impact on LRMs' knowledge memorization than reasoning, highlighting the risks of pruning and distillation; (2) The MLP up projection in the final layer of distilled LRMs is one of the most important components, offering a new perspective on locating critical weights - a fundamental problem in model compression; and (3) Current quantization methods overly compress the final-layer modules and MLP gate projections, so protecting just 2% of all weights that are excessively compressed can raise average accuracy by 6.57%, greatly surpassing the state-of-the-art.

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

TL;DR

This paper addresses how compression (quantization, distillation, pruning) impacts the reasoning and knowledge capabilities of large reasoning models (LRMs). It combines performance benchmarking on four challenging reasoning datasets with mechanistic interpretability, using weight-level analyses to map how individual components contribute to reasoning tasks. Key findings show that weight count more strongly affects memorization than reasoning, the final-layer up-projection is among the most critical weights for reasoning, and that current quantization overly compresses final-layer modules and gating projections—yet protecting a small fraction of weights can yield substantial accuracy gains. These insights offer practical guidance for designing compression pipelines that preserve both knowledge and reasoning, informing targeted preservation strategies for LRMs.

Abstract

Compression methods, including quantization, distillation, and pruning, improve the computational efficiency of large reasoning models (LRMs). However, existing studies either fail to sufficiently compare all three compression methods on LRMs or lack in-depth interpretation analysis. In this paper, we investigate how the reasoning capabilities of LRMs are compromised during compression, through performance benchmarking and mechanistic interpretation. To uncover the effects of compression on reasoning performance, we benchmark quantized, distilled, and pruned DeepSeek-R1 models on four reasoning datasets (AIME 2024, FOLIO, Temporal Sequences, and MuSiQue). To precisely locate compression effects on model weights, we adapt difference of means and attribution patching techniques, focusing on the activation of every linear component in compressed LRMs, to interpret fine-grained causal relationships between weights and various reasoning capabilities. This fine-grained interpretation addresses a fundamental question of compression: which weights are the most important for reasoning? Overall, we find dynamically quantized 2.51-bit R1 reaches close-to-R1 performance. With empirical verification, we present three main findings that generalize across both Llama and Qwen: (1) Weight count has a greater impact on LRMs' knowledge memorization than reasoning, highlighting the risks of pruning and distillation; (2) The MLP up projection in the final layer of distilled LRMs is one of the most important components, offering a new perspective on locating critical weights - a fundamental problem in model compression; and (3) Current quantization methods overly compress the final-layer modules and MLP gate projections, so protecting just 2% of all weights that are excessively compressed can raise average accuracy by 6.57%, greatly surpassing the state-of-the-art.

Paper Structure

This paper contains 31 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: An overview of our pipeline. Left: We benchmark compressed R1 variants on various reasoning tasks. Right: By computing weight importance towards a specific reasoning behavior (a dot product of the steering vector and gradients with respect to an LRM's activations), we study the compression effects on individual weight. We empirically verify our findings on weight importance by selectively quantizing or protecting a module to test its importance.
  • Figure 2: $\mathbf{I}_{m\ell}^c$ of DeepSeek-R1-Distill-Llama-8B (left) and change of $\mathbf{RI}_{m\ell}^c$ from DeepSeek-R1-Distill-Llama-8B to Llama-3.1-8B (right). Each heatmap displays scores of importance (or importance shift) of every module at each layer, providing a fine-grained analysis of weight contributions to the corresponding reasoning capability. On the right, increases in $\mathbf{RI}_{m\ell}^c$ are set to $0$, as they only offset decreases elsewhere as discussed in \ref{['sec:decoding_compression']}. Every cluster of 4 side-by-side heatmaps (including those displayed below) follow the same scaling to show the precise magnitude of each weight module.
  • Figure 3: Change of $\mathbf{RI}_{m\ell}^c$ from DeepSeek-R1-Distill-Llama-8B to its 4-bit AWQ variant.
  • Figure 4: $\mathbf{I}_{m\ell}^c$ of DeepSeek-R1-Distill-Qwen-7B.
  • Figure 5: Change of $\mathbf{RI}_{m\ell}^c$ from DeepSeek-R1-Distill-Llama-8B to Qwen2.5-Math-7B (the backbone model).
  • ...and 3 more figures