Table of Contents
Fetching ...

DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration

Tianteng Gu, Bei Liu, Bo Xiao, Ke Zeng, Jiacheng Liu, Yanmin Qian

TL;DR

The paper introduces DenoiseRotator, a plug-and-play framework that enhances pruning robustness for LLMs by redistributing parameter importance through learnable orthogonal transformations. By minimizing the information entropy of normalized importance scores, the method concentrates importance onto a smaller subset of weights, improving resilience to both unstructured and 2:4 semi-structured pruning. The approach is compatible with existing pruning techniques (Magnitude, Wanda, SparseGPT) and yields consistent improvements in perplexity and zero-shot accuracy across Mistral, LLaMA-3, and Qwen-2.5, with modest inference overhead and the potential for further efficiency via block-diagonal rotations. These results suggest entropy-guided importance reshaping as a principled strategy for robust, efficient sparsification of large language models.

Abstract

Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model's weight matrices. Our method can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.

DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration

TL;DR

The paper introduces DenoiseRotator, a plug-and-play framework that enhances pruning robustness for LLMs by redistributing parameter importance through learnable orthogonal transformations. By minimizing the information entropy of normalized importance scores, the method concentrates importance onto a smaller subset of weights, improving resilience to both unstructured and 2:4 semi-structured pruning. The approach is compatible with existing pruning techniques (Magnitude, Wanda, SparseGPT) and yields consistent improvements in perplexity and zero-shot accuracy across Mistral, LLaMA-3, and Qwen-2.5, with modest inference overhead and the potential for further efficiency via block-diagonal rotations. These results suggest entropy-guided importance reshaping as a principled strategy for robust, efficient sparsification of large language models.

Abstract

Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model's weight matrices. Our method can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.

Paper Structure

This paper contains 23 sections, 12 equations, 3 figures, 25 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the DenoiseRotator framework. The top row illustrates a Transformer transformer layer architecture—used in mainstream models such as LLaMA llama3, Mistral mistral7b, and Qwen qwen2.5—that consists of RMSNorm, attention, and feed-forward blocks. In the middle, learnable orthogonal matrices are inserted to rotate the weight matrices, concentrating parameter importance before pruning. The rotated weights are then merged and pruned in the bottom row. In this illustration, linear layers are represented in the $Y=XW$ format.
  • Figure 2: Visualization of OBD importance in Eq \ref{['eq:obd']} for the output projection in the first layer of LLaMA-3-8B before and after orthogonal rotation. (a) and (b) show the 3D heatmaps of importance scores of the weight matrix before and after applying DenoiseRotator, respectively. (c) and (d) display the corresponding importance distributions, highlighting parameters before pruning (in blue) and after pruning (in orange) by the Wanda method. After rotation, importance becomes more concentrated.
  • Figure 3: Illustration of the QR decomposition reparameterization process. The diagram shows how an unconstrained matrix $A$ is optimized indirectly to achieve constrained optimization of the orthogonal matrix $Q$, while interfacing with the loss function during forward and backward passes. The process leverages QR decomposition to preserve orthogonality and integrate seamlessly into gradient-based optimization methods.