Data-free Weight Compress and Denoise for Large Language Models
Runyu Peng, Yunhua Zhou, Qipeng Guo, Yang Gao, Hang Yan, Xipeng Qiu, Dahua Lin
TL;DR
This work addresses the scalability bottlenecks of large language models by introducing Data-free Joint Rank-k Approximation, a matrix-decomposition based compression that operates without calibration data. By jointly compressing interconnected linear layers in Transformer blocks (notably $W_Q$, $W_K$, and FFN components $W_{gate}$, $W_{up}$), the method preserves the essential mapping space while achieving high sparsity with minimal performance loss. Theoretical grounding in rank-k approximation and subspace analysis supports a denoise hypothesis, predicting that low-intensity noisy components can be removed without compromising core functionality. Empirical results on OpenCompass zero-shot tasks across LLaMA-7B and related models show promising retention of performance at 10–20% prune, and watermark purification experiments illustrate potential denoising benefits, highlighting practical impact for robust, calibration-free compression of LLMs.
Abstract
Large Language Models (LLMs) are reshaping the research landscape in artificial intelligence, particularly as model parameters scale up significantly, unlocking remarkable capabilities across various domains. Nevertheless, the scalability of model parameters faces constraints due to limitations in GPU memory and computational speed. To address these constraints, various weight compression methods have emerged, such as Pruning and Quantization. Given the low-rank nature of weight matrices in language models, the reduction of weights through matrix decomposition undoubtedly holds significant potential and promise. In this paper, drawing upon the intrinsic structure of LLMs, we propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. Significantly, our method is characterized by without necessitating additional involvement of any corpus, while simultaneously preserving orthogonality in conjunction with pruning and quantization methods. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data. Additionally, we explore the fundamental properties of the weight matrix of LLMs undergone Rank-k Approximation and conduct comprehensive experiments to elucidate our hypothesis.
