DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration
Tianteng Gu, Bei Liu, Bo Xiao, Ke Zeng, Jiacheng Liu, Yanmin Qian
TL;DR
The paper introduces DenoiseRotator, a plug-and-play framework that enhances pruning robustness for LLMs by redistributing parameter importance through learnable orthogonal transformations. By minimizing the information entropy of normalized importance scores, the method concentrates importance onto a smaller subset of weights, improving resilience to both unstructured and 2:4 semi-structured pruning. The approach is compatible with existing pruning techniques (Magnitude, Wanda, SparseGPT) and yields consistent improvements in perplexity and zero-shot accuracy across Mistral, LLaMA-3, and Qwen-2.5, with modest inference overhead and the potential for further efficiency via block-diagonal rotations. These results suggest entropy-guided importance reshaping as a principled strategy for robust, efficient sparsification of large language models.
Abstract
Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model's weight matrices. Our method can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.
