Table of Contents
Fetching ...

Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, Grace Li Zhang

TL;DR

Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches and parameter sharing techniques, especially under large compression ratios.

Abstract

Large Language Models (LLMs) have achieved remarkable breakthroughs. However, the huge number of parameters in LLMs require significant amount of memory storage in inference, which prevents their practical deployment in many applications. To reduce memory storage of LLMs, singular value decomposition (SVD) provides a promising solution to approximate weight matrices for compressing LLMs. In this paper, we take a step further to explore parameter sharing across different layers with SVD to achieve more effective compression for LLMs. Specifically, weight matrices in different layers are decomposed and represented as a linear combination of a set of shared basis vectors and unique coefficients. The types of weight matrices and the layer selection for basis sharing are examined when compressing LLMs to maintain the performance. Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches and parameter sharing techniques, especially under large compression ratios. Code is available at: https://github.com/TUDa-HWAI/Basis_Sharing

Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

TL;DR

Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches and parameter sharing techniques, especially under large compression ratios.

Abstract

Large Language Models (LLMs) have achieved remarkable breakthroughs. However, the huge number of parameters in LLMs require significant amount of memory storage in inference, which prevents their practical deployment in many applications. To reduce memory storage of LLMs, singular value decomposition (SVD) provides a promising solution to approximate weight matrices for compressing LLMs. In this paper, we take a step further to explore parameter sharing across different layers with SVD to achieve more effective compression for LLMs. Specifically, weight matrices in different layers are decomposed and represented as a linear combination of a set of shared basis vectors and unique coefficients. The types of weight matrices and the layer selection for basis sharing are examined when compressing LLMs to maintain the performance. Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches and parameter sharing techniques, especially under large compression ratios. Code is available at: https://github.com/TUDa-HWAI/Basis_Sharing
Paper Structure (29 sections, 4 equations, 9 figures, 8 tables)

This paper contains 29 sections, 4 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: (a) Two layers share the same weight matrix in previous work. (b) Two layers share the same basis matrix but have their individual coefficients in our work.
  • Figure 2: Weight matrices across $n$ layers are concatenated horizontally into a weight matrix, which is processed by SVD. The $j^{th}$ column of the original weight matrix in a layer can be represented as a linear combination of $k$ shared basis vectors and coefficients.
  • Figure 3: PPL ($\downarrow$) of three different LLMs -- OPT-6.7B, LLaMA 2-7B, and Mistral-7B -- under 20% compression ratio on WikiText-2.
  • Figure 4: Frobenius loss incurred by basis sharing across any two layers. The number/color in a block represents the resulting Frobenius loss if a basis matrix is shared by two layers and the numbers in the diagonal direction are obtained by applying SVD to the scaled weight matrix of a layer directly. (a) Frobenius loss incurred by basis sharing across two layers for ${\bm{W}}_K$ in LLaMA2-7B. (b) Frobenius loss incurred by basis sharing across two layers for ${\bm{W}}_O$ in LLaMA2-7B.
  • Figure 5: LoRA fine-tuning results of LLaMA-7B under 20% compression ratio with different compression methods.
  • ...and 4 more figures