Table of Contents
Fetching ...

SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

Xinhao Huang, You-Liang Huang, Zeyi Wen

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but the billion-scale parameters pose deployment challenges. Although existing methods attempt to reduce the scale of LLMs, they require either special hardware support or expensive post-training to maintain model quality. To facilitate efficient and affordable model slimming, we propose a novel training-free compression method for LLMs, named "SoLA", which leverages \textbf{So}ft activation sparsity and \textbf{L}ow-r\textbf{A}nk decomposition. SoLA can identify and retain a minority of components significantly contributing to inference, while compressing the majority through low-rank decomposition, based on our analysis of the activation pattern in the feed-forward network (FFN) of modern LLMs. To alleviate the decomposition loss, SoLA is equipped with an adaptive component-wise low-rank allocation strategy to assign appropriate truncation positions for different weight matrices. We conduct extensive experiments on LLaMA-2-7B/13B/70B and Mistral-7B models across a variety of benchmarks. SoLA exhibits remarkable improvement in both language modeling and downstream task accuracy without post-training. For example, with a 30\% compression rate on the LLaMA-2-70B model, SoLA surpasses the state-of-the-art method by reducing perplexity from 6.95 to 4.44 and enhancing downstream task accuracy by 10\%.

SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but the billion-scale parameters pose deployment challenges. Although existing methods attempt to reduce the scale of LLMs, they require either special hardware support or expensive post-training to maintain model quality. To facilitate efficient and affordable model slimming, we propose a novel training-free compression method for LLMs, named "SoLA", which leverages \textbf{So}ft activation sparsity and \textbf{L}ow-r\textbf{A}nk decomposition. SoLA can identify and retain a minority of components significantly contributing to inference, while compressing the majority through low-rank decomposition, based on our analysis of the activation pattern in the feed-forward network (FFN) of modern LLMs. To alleviate the decomposition loss, SoLA is equipped with an adaptive component-wise low-rank allocation strategy to assign appropriate truncation positions for different weight matrices. We conduct extensive experiments on LLaMA-2-7B/13B/70B and Mistral-7B models across a variety of benchmarks. SoLA exhibits remarkable improvement in both language modeling and downstream task accuracy without post-training. For example, with a 30\% compression rate on the LLaMA-2-70B model, SoLA surpasses the state-of-the-art method by reducing perplexity from 6.95 to 4.44 and enhancing downstream task accuracy by 10\%.

Paper Structure

This paper contains 26 sections, 1 theorem, 9 equations, 5 figures, 7 tables.

Key Result

Theorem 1

Given an input $X$, a weight matrix $W$ and its singular value decomposition results from $U \Sigma V^T=W$. Let $S$ be the Cholesky decomposition of $XX_T$. The compression loss of truncating the smallest singular values is $L^2=\Vert \sum_{i=m+1}^{r} \sigma_i u_i v_{i}^{T} S^{-1} X \Vert_{F}^{2}=\s

Figures (5)

  • Figure 1: Framework of the proposed SoLA. We initially recognize the soft activation sparsity within the feed-forward network. Leveraging this property, we introduce a fine-grained model decomposition technique to preserve model quality. Furthermore, to alleviate the compression error of SVD, we develop an adaptive component-wise truncation strategy to allocate appropriate truncation positions for different types of weight matrices.
  • Figure 2: Accumulation of $\Vert X W \Vert_F^2$ and distribution of $\Vert X W \Vert_F$ across neurons in different layers of LLaMA-2-7B and LLaMA-2-13B on WikiText2 and c4 datasets, sorted from largest to smallest, highlighting the soft activation sparsity phenomenon.
  • Figure 3: Perplexity of WikiText2 among different methods on LLaMA-2-13B.
  • Figure 4: The impact of "Prime Neurons" ratios on LLaMA-2-13B perplexity under 20% and 30% compression ratios.
  • Figure 5: Perplexity of LLaMA-2-13B under 30% compression ratio using calibration data with different numbers (32, 64, 128, 256) and types (WikiText2 and C4).

Theorems & Definitions (1)

  • Theorem 1