Table of Contents
Fetching ...

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, Tuo Zhao

TL;DR

LoSparse introduces a dual-component weight approximation for Transformers, representing each weight as $W = UV + S$ to decouple coherent (expressive) and incoherent (non-expressive) parts. The low-rank term captures shared, coherent structure, while the sparse term targets diverse, task-specific information, enabling structured pruning of neurons with a principled initialization from SVD and an iterative, scheduled pruning process. Empirical results across GLUE, SQuADv1.1, and XSum show LoSparse outperforms traditional pruning and pure low-rank methods, and it remains compatible with knowledge distillation and other compression strategies. The approach provides a robust, generalizable path to high-sparsity compression of large language models with minimal performance loss, boosting practicality for real-world deployment.

Abstract

Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources. To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods.

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

TL;DR

LoSparse introduces a dual-component weight approximation for Transformers, representing each weight as to decouple coherent (expressive) and incoherent (non-expressive) parts. The low-rank term captures shared, coherent structure, while the sparse term targets diverse, task-specific information, enabling structured pruning of neurons with a principled initialization from SVD and an iterative, scheduled pruning process. Empirical results across GLUE, SQuADv1.1, and XSum show LoSparse outperforms traditional pruning and pure low-rank methods, and it remains compatible with knowledge distillation and other compression strategies. The approach provides a robust, generalizable path to high-sparsity compression of large language models with minimal performance loss, boosting practicality for real-world deployment.

Abstract

Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources. To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods.
Paper Structure (31 sections, 13 equations, 6 figures, 13 tables, 1 algorithm)

This paper contains 31 sections, 13 equations, 6 figures, 13 tables, 1 algorithm.

Figures (6)

  • Figure 1: Histogram of neuron importance scores. (a) The practical neuron importance scores of a linear layer when pruning BART-large on XSum. (d) The ideal histogram of the neuron importance scores where most of the neuron should be redundant, otherwise pruning is not the best choice.
  • Figure 2: Illustration of one linear projection in a transformer neural network. We use $UV +S$, a low-rank approximation plus a sparse matrix, to approximate the weight matrix $W$. $UV$ and $S$ indicate the coherent and incoherent parts of neurons in $W$ respectively. We conduct the forward pass of two terms in parallel.
  • Figure 3: Singular values in language models. (a) Singular values of weight matrices of the 10th decoder layer in BART-large; (b) Singular values of weight matrices of the 14th encoder layer in DeBERTaV3-large.
  • Figure 4: Neuron importance scores of selected linear projections when compressing DeBERTaV3-base on SST-2 with ITP (blue) and LoSparse (orange). It shows LoSparse successfully separates incoherent parts of neurons and make it easy to prune the non-expressive components.
  • Figure 5: Comparison between LoSparse and two variants of low-rank approximation on different tasks. The $x$-axis represents the remaining ratios. LoSparse outperforms all other low-rank approximation variants. It indicates adding sparse approximation can promote the performance low-rank approximation.
  • ...and 1 more figures