Table of Contents
Fetching ...

Data-free Weight Compress and Denoise for Large Language Models

Runyu Peng, Yunhua Zhou, Qipeng Guo, Yang Gao, Hang Yan, Xipeng Qiu, Dahua Lin

TL;DR

This work addresses the scalability bottlenecks of large language models by introducing Data-free Joint Rank-k Approximation, a matrix-decomposition based compression that operates without calibration data. By jointly compressing interconnected linear layers in Transformer blocks (notably $W_Q$, $W_K$, and FFN components $W_{gate}$, $W_{up}$), the method preserves the essential mapping space while achieving high sparsity with minimal performance loss. Theoretical grounding in rank-k approximation and subspace analysis supports a denoise hypothesis, predicting that low-intensity noisy components can be removed without compromising core functionality. Empirical results on OpenCompass zero-shot tasks across LLaMA-7B and related models show promising retention of performance at 10–20% prune, and watermark purification experiments illustrate potential denoising benefits, highlighting practical impact for robust, calibration-free compression of LLMs.

Abstract

Large Language Models (LLMs) are reshaping the research landscape in artificial intelligence, particularly as model parameters scale up significantly, unlocking remarkable capabilities across various domains. Nevertheless, the scalability of model parameters faces constraints due to limitations in GPU memory and computational speed. To address these constraints, various weight compression methods have emerged, such as Pruning and Quantization. Given the low-rank nature of weight matrices in language models, the reduction of weights through matrix decomposition undoubtedly holds significant potential and promise. In this paper, drawing upon the intrinsic structure of LLMs, we propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. Significantly, our method is characterized by without necessitating additional involvement of any corpus, while simultaneously preserving orthogonality in conjunction with pruning and quantization methods. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data. Additionally, we explore the fundamental properties of the weight matrix of LLMs undergone Rank-k Approximation and conduct comprehensive experiments to elucidate our hypothesis.

Data-free Weight Compress and Denoise for Large Language Models

TL;DR

This work addresses the scalability bottlenecks of large language models by introducing Data-free Joint Rank-k Approximation, a matrix-decomposition based compression that operates without calibration data. By jointly compressing interconnected linear layers in Transformer blocks (notably , , and FFN components , ), the method preserves the essential mapping space while achieving high sparsity with minimal performance loss. Theoretical grounding in rank-k approximation and subspace analysis supports a denoise hypothesis, predicting that low-intensity noisy components can be removed without compromising core functionality. Empirical results on OpenCompass zero-shot tasks across LLaMA-7B and related models show promising retention of performance at 10–20% prune, and watermark purification experiments illustrate potential denoising benefits, highlighting practical impact for robust, calibration-free compression of LLMs.

Abstract

Large Language Models (LLMs) are reshaping the research landscape in artificial intelligence, particularly as model parameters scale up significantly, unlocking remarkable capabilities across various domains. Nevertheless, the scalability of model parameters faces constraints due to limitations in GPU memory and computational speed. To address these constraints, various weight compression methods have emerged, such as Pruning and Quantization. Given the low-rank nature of weight matrices in language models, the reduction of weights through matrix decomposition undoubtedly holds significant potential and promise. In this paper, drawing upon the intrinsic structure of LLMs, we propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. Significantly, our method is characterized by without necessitating additional involvement of any corpus, while simultaneously preserving orthogonality in conjunction with pruning and quantization methods. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data. Additionally, we explore the fundamental properties of the weight matrix of LLMs undergone Rank-k Approximation and conduct comprehensive experiments to elucidate our hypothesis.
Paper Structure (25 sections, 1 theorem, 15 equations, 6 figures, 2 tables)

This paper contains 25 sections, 1 theorem, 15 equations, 6 figures, 2 tables.

Key Result

Theorem 3.1

For any real matrix $A \in \mathbb{R}_{n \times m}$ and its rank $r\leq min(n,m)$, three matrices $U\in \mathbb{R}_{n \times r}$, $\Lambda\in\mathbb{R}_{r \times r}$ and $V\in \mathbb{R}_{r \times m}$ can be found, where: $U^TU=VV^T=\mathbb{I}_{r \times r}$, $\Lambda$ is a real diagonal square matri

Figures (6)

  • Figure 1: An illustration of output analysis of linear layers with noises from training. $y^*$ represent the output of origin weight matrix, and $y'=y^{*//U_{(k)}}$ is the output of weight matrix undergone Rank-k Approximation. $y$ stands for the output of an ideal weight matrix without noises inside low-intensity components. We posit that $||y-y^*||_2 \geq ||y-y'||_2$ in most scenarios.
  • Figure 2: This picture describe how Rank-k Approximation affects on both input feature space and output feature space of joint weight matrix of $W_Q$ and $W_K$. $V_{(k)}^T$ stands for the subspace described by the reduced input analysis matrix and the matrix itself. $x, q, k$ are the original activations and $x', q', k'$ are equivalent representation of activations calculated with the reduced matrix. The similarity of $q$ and $k$ can be approximated by that of $q'$ and $k'$.
  • Figure 3: Comparing Joint Rank-k Approximation and separate Rank-k Approximation based on different setups. We conduct the same ablation study on both LLaMA-7B and LLaMA2-7B to avoid coincidence. The prune ratio stands for the parameter amount portion comparing to the origin matrix. As SVD introduces extra parameters to represent the original matrix, the prune ratio will be larger than 1 without Rank-k Approximation afterwards. For (a)(d), we conduct Rank-$k$ Approximation on $W_Q$ and $W_K$ within each attention head. For (b)(e), we conduct Rank-$k$ Approximation on $W_Q$ and $W_K$, regardless of the head division. For (c)(f), we conduct Rank-$k$ Approximation on $W_{gate}$ and $W_{up}$.
  • Figure 4: Accuracy of zero-shot evaluation on OpenbookQA with approximated Mistral-7B model on $W_Q$ and $W_K$. The prune ratio refs to Figure \ref{['ablation']}, standing for the parameter amount portion comparing to the origin matrix. The diamond mark represents that the accuracy of pruned model exceeds that of unapproximated weights.
  • Figure 5: The perplexity of given sentence on different approximation ratio of fine-tuned model. The triangle marks represent where model fails to reliably generate the given sentence. The approximation ratio $r$ is correspond to a Joint Rank-$(r \times 4096)$ Approximation on $W_{gate}$ and $W_{up}$.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • Definition 3.2