1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models
Zeliang Zong, Kai Zhang, Zheyang Li, Wenming Tan, Ye Ren, Yiyan Zhai, Jilin Hu
TL;DR
SSLC tackles the high resource demands of large language models by jointly compressing weights through a data-aware decomposition into a low-rank part $L$ and a sparse part $S$, minimizing the reconstruction loss $\|(W-L-S)X\|_F$ with rank $r$ and sparsity $k\%$. The method alternates between sparsification and low-rank approximation (randomized SVD-based), preserves the top 1% most important weights, and recovers performance via fine-tuning of the low-rank factors ($U$ and $V$) while keeping $S$ fixed. Empirical results on LLaMA and Qwen2.5 (7B–70B) show SSLC achieves state-of-the-art compression without extra training, including a 50% compression on Qwen2.5 with no performance loss and substantial speedups (e.g., $\approx1.63\times$). The approach also enables effective fine-tuning with minimal parameter overhead and demonstrates meaningful acceleration on hardware simulators and real-world throughput improvements, highlighting practical deployment benefits. Overall, SSLC provides a principled, data-driven pathway to deploy large language models more efficiently while preserving capabilities.
Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency in language comprehension and generation; however, their widespread adoption is constrained by substantial bandwidth and computational demands. While pruning and low-rank approximation have each demonstrated promising performance individually, their synergy for LLMs remains underexplored. We introduce \underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank \underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths of both techniques: low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization. Based on theoretical analysis, we first formulate the low-rank approximation and sparse optimization as a unified problem and solve it by iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models (7B-70B) show that SSLC, without any additional training steps, consistently surpasses standalone methods, achieving state-of-the-arts results. Notably, SSLC compresses Qwen2.5 by 50\% with no performance drop and achieves at least 1.63$\times$ speedup, offering a practical solution for efficient LLM deployment.
