Table of Contents
Fetching ...

1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models

Zeliang Zong, Kai Zhang, Zheyang Li, Wenming Tan, Ye Ren, Yiyan Zhai, Jilin Hu

TL;DR

SSLC tackles the high resource demands of large language models by jointly compressing weights through a data-aware decomposition into a low-rank part $L$ and a sparse part $S$, minimizing the reconstruction loss $\|(W-L-S)X\|_F$ with rank $r$ and sparsity $k\%$. The method alternates between sparsification and low-rank approximation (randomized SVD-based), preserves the top 1% most important weights, and recovers performance via fine-tuning of the low-rank factors ($U$ and $V$) while keeping $S$ fixed. Empirical results on LLaMA and Qwen2.5 (7B–70B) show SSLC achieves state-of-the-art compression without extra training, including a 50% compression on Qwen2.5 with no performance loss and substantial speedups (e.g., $\approx1.63\times$). The approach also enables effective fine-tuning with minimal parameter overhead and demonstrates meaningful acceleration on hardware simulators and real-world throughput improvements, highlighting practical deployment benefits. Overall, SSLC provides a principled, data-driven pathway to deploy large language models more efficiently while preserving capabilities.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in language comprehension and generation; however, their widespread adoption is constrained by substantial bandwidth and computational demands. While pruning and low-rank approximation have each demonstrated promising performance individually, their synergy for LLMs remains underexplored. We introduce \underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank \underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths of both techniques: low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization. Based on theoretical analysis, we first formulate the low-rank approximation and sparse optimization as a unified problem and solve it by iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models (7B-70B) show that SSLC, without any additional training steps, consistently surpasses standalone methods, achieving state-of-the-arts results. Notably, SSLC compresses Qwen2.5 by 50\% with no performance drop and achieves at least 1.63$\times$ speedup, offering a practical solution for efficient LLM deployment.

1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models

TL;DR

SSLC tackles the high resource demands of large language models by jointly compressing weights through a data-aware decomposition into a low-rank part and a sparse part , minimizing the reconstruction loss with rank and sparsity . The method alternates between sparsification and low-rank approximation (randomized SVD-based), preserves the top 1% most important weights, and recovers performance via fine-tuning of the low-rank factors ( and ) while keeping fixed. Empirical results on LLaMA and Qwen2.5 (7B–70B) show SSLC achieves state-of-the-art compression without extra training, including a 50% compression on Qwen2.5 with no performance loss and substantial speedups (e.g., ). The approach also enables effective fine-tuning with minimal parameter overhead and demonstrates meaningful acceleration on hardware simulators and real-world throughput improvements, highlighting practical deployment benefits. Overall, SSLC provides a principled, data-driven pathway to deploy large language models more efficiently while preserving capabilities.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in language comprehension and generation; however, their widespread adoption is constrained by substantial bandwidth and computational demands. While pruning and low-rank approximation have each demonstrated promising performance individually, their synergy for LLMs remains underexplored. We introduce \underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank \underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths of both techniques: low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization. Based on theoretical analysis, we first formulate the low-rank approximation and sparse optimization as a unified problem and solve it by iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models (7B-70B) show that SSLC, without any additional training steps, consistently surpasses standalone methods, achieving state-of-the-arts results. Notably, SSLC compresses Qwen2.5 by 50\% with no performance drop and achieves at least 1.63 speedup, offering a practical solution for efficient LLM deployment.

Paper Structure

This paper contains 33 sections, 14 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Weight salience huang2024slim in LLaMA2-7B before and after synergistic low-rank approximation. Compared to Figure (a), Figure (b) not only shows a substantial reduction in extreme high values, but also reveals a decrease in prunable low values, thus mitigating the performance degradation caused by pruning.
  • Figure 2: The pipeline of our proposed SSLC method involves the following steps: Initially, the SVD step performs a low-rank approximation on the scaled matrix. Subsequently, the pruning step converts the dense matrix into a sparse one. In essence, SSLC executes $T$-step SVD and pruning iterations on the scaled matrix, decomposing the original weight matrix W into a sparse matrix $S_t$ and low-dimensional matrices $V_t$ and $U_t$. After the final iteration, the method multiplies $V_t$ and $S_t$ by the scaling matrix $\left \|X \right \| _2^{-1}$, to revert to the original matrix state before scaling.
  • Figure 3: Fine-tuning under different types of pruning. (a) introduces an additional LoRA parameter. In contrast, the low-dimensional matrix ($D_{low} \leq 128$) from SSLC framework can be directly used for fine-tuning.
  • Figure 4: Retaining 80% of the total salience, the pure pruning method necessitates keeping the top 42.3% of parameters, which compresses 57.7% parameters. In contrast, the synergistic method requires only the top 32.3% of parameters to form a sparse matrix, and with the additional 6.25% from the low-rank matrix. The overall reserved parameter ratio (38.6%) remains lower than that of the pure pruning method (42.3%), which shows the compression "rate spread" of 3.7%.
  • Figure 5: The current decomposition loss, denoted as $\left \| (W-L_t-S_t)X\right \| _F$, for the down projection matrices of different layers in LLaMA2-7B varies as a percentage of the initial loss with respect to the number of iterations.
  • ...and 1 more figures