Compressing Large Language Models using Low Rank and Low Precision Decomposition

Rajarshi Saha; Naomi Sagan; Varun Srivastava; Andrea J. Goldsmith; Mert Pilanci

Compressing Large Language Models using Low Rank and Low Precision Decomposition

Rajarshi Saha, Naomi Sagan, Varun Srivastava, Andrea J. Goldsmith, Mert Pilanci

TL;DR

Results illustrate that compressing LlaMa-$2$ $7$B/$13B$/$70$B and LlaMa-$3$ $8$B models using $\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than $2.5$ bits per parameter.

Abstract

The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces $\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$. Here, $\mathbf{L}$ and $\mathbf{R}$ are low rank factors, and the entries of $\mathbf{Q}$, $\mathbf{L}$ and $\mathbf{R}$ are quantized. The model is compressed by substituting each layer with its $\mathbf{Q} + \mathbf{L}\mathbf{R}$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, $\mathbf{L}$ and $\mathbf{R}$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. $\rm CALDERA$ obtains this decomposition by formulating it as an optimization problem $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$, where $\mathbf{X}$ is the calibration data, and $\mathbf{Q}, \mathbf{L}, \mathbf{R}$ are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of $\rm CALDERA$ are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-$2$ $7$B/$13B$/$70$B and LlaMa-$3$ $8$B models using $\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than $2.5$ bits per parameter. The implementation is available at: https://github.com/pilancilab/caldera.

Compressing Large Language Models using Low Rank and Low Precision Decomposition

TL;DR

Results illustrate that compressing LlaMa-

B and LlaMa-

B models using

outperforms existing post-training LLM compression techniques in the regime of less than

bits per parameter.

Abstract

The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces

-- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix

by approximating it via a low-rank, low-precision decomposition as

. Here,

and

are low rank factors, and the entries of

and

are quantized. The model is compressed by substituting each layer with its

decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally,

and

are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance.

obtains this decomposition by formulating it as an optimization problem

, where

is the calibration data, and

are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of

are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-

B and LlaMa-

B models using

outperforms existing post-training LLM compression techniques in the regime of less than

bits per parameter. The implementation is available at: https://github.com/pilancilab/caldera.

Paper Structure (35 sections, 13 theorems, 80 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 35 sections, 13 theorems, 80 equations, 5 figures, 8 tables, 2 algorithms.

Introduction
Significance and Related Works
Problem Formulation
Proposed Algorithm: Calibration-Aware Low-Precision Decomposition with Low Rank Adaptation
Approximation Error Analysis
Analysis Outline
Numerical Simulations
Zero-shot Results
Fine-tuning of Randomized Hadamard Transform (RHT) Parameters
Low Rank Adaptation (LoRA) Fine-tuning Results
Autoregressive Generation Throughput
Conclusions
Notations
Rank-constrained Regression
Derivations for Calibration-Aware Low-Precision and Low-Rank Decomposition: caldera
...and 20 more sections

Key Result

Theorem 4.1

Approximation error of caldera (Informal) Given $\mathbf{W} \in \mathbb{R}^{n \times d}$ and $\mathbf{X} \in \mathbb{R}^{m \times d}$ with $m \leq d$, let $\mathbf{D}$ be obtained from the LDL decomposition $\mathbf{X}^\top\mathbf{X} = m\mathbf{H} = (\mathbf{M} + \mathbf{I})\mathbf{D}(\mathbf{M} + \ while utilizing an average budget of $\frac{1}{2}\log_2\left(\frac{k\sigma_1^3}{\mathcolor{darkblue

Figures (5)

Figure 1: Decaying spectrum of weight matrices (aka, "approximate low-rank")
Figure 2: caldera decomposes a full-precision weight matrix into a low-rank component ($\mathbf{L}\mathbf{R}$), which captures the contribution of the top singular values using $\mathrm{B}_{\rm L}, \mathrm{B}_{\rm R}$ bits, and $\mathbf{Q}$ for the trailing singular values with $\mathrm{B}_{\rm Q}$ bits, enabling flexible precision settings for each component. Typically, $\mathrm{B}_{\rm Q} < \mathrm{B}_{\rm L}, \mathrm{B}_{\rm R}$.
Figure 3: Throughputs for meta-llama/Llama-2-{7,70}b-hf on an NVIDIA A10G GPU for a batch size and sequence length of $1$ ($\mathrm{B}_{\rm Q} = 2$ for all rows)
Figure 4: Relative data-aware Frobenius norm error per iteration of caldera for selected matrices of LLaMa-2 7B layer 25. For all experiments, the bit precision of $\mathbf{Q}$ is $2$, and the calibration dataset is the same as used in §\ref{['sec:numerical-simulations']}. The first iteration of caldera with the Hessian update is omitted, as it has a large error, inhibiting plot readability.
Figure 5: Relative data-aware Frobenius norm error per iteration of LPLRFactorize, for the decomposition $\mathbf{W} \approx \mathbf{L} \mathbf{R}$, for two matrices in LLaMa-2 7B layer 25.

Theorems & Definitions (20)

Theorem 4.1
Lemma 4.2
Lemma B.1
proof
Lemma C.1
proof
Lemma C.2
proof
Lemma C.3
proof
...and 10 more

Compressing Large Language Models using Low Rank and Low Precision Decomposition

TL;DR

Abstract

Compressing Large Language Models using Low Rank and Low Precision Decomposition

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (20)