Table of Contents
Fetching ...

AA-SVD : Anchored and Adaptive SVD for Large Language Model Compression

Atul Kumar Sinha, François Fleuret

Abstract

We introduce a fast low-rank factorization-based framework for compressing large language models that enables rapid compression of billion-parameter models without retraining. Unlike existing factorization-based approaches that optimize only on the original inputs, ignoring distribution shifts from upstream compression and thus propagating errors forward, or those that rely only on shifted inputs and risk drifting away from the original outputs, our approach accounts for both. Beyond individual layer compression, we further refine each transformer block end-to-end, minimizing block-level output distortion and allowing compressed layers to jointly compensate for accumulated errors. By anchoring each compressed layer to the original outputs while explicitly modeling input distribution shifts, our method finds a low-rank approximation that maintains functional equivalence with the original model. Experiments on large language models show that our method consistently outperforms existing SVD-based baselines across compression ratios, with the advantage becoming increasingly pronounced at aggressive compression budgets, where competing methods degrade substantially or collapse entirely, offering a practical solution for efficient, large-scale model deployment.

AA-SVD : Anchored and Adaptive SVD for Large Language Model Compression

Abstract

We introduce a fast low-rank factorization-based framework for compressing large language models that enables rapid compression of billion-parameter models without retraining. Unlike existing factorization-based approaches that optimize only on the original inputs, ignoring distribution shifts from upstream compression and thus propagating errors forward, or those that rely only on shifted inputs and risk drifting away from the original outputs, our approach accounts for both. Beyond individual layer compression, we further refine each transformer block end-to-end, minimizing block-level output distortion and allowing compressed layers to jointly compensate for accumulated errors. By anchoring each compressed layer to the original outputs while explicitly modeling input distribution shifts, our method finds a low-rank approximation that maintains functional equivalence with the original model. Experiments on large language models show that our method consistently outperforms existing SVD-based baselines across compression ratios, with the advantage becoming increasingly pronounced at aggressive compression budgets, where competing methods degrade substantially or collapse entirely, offering a practical solution for efficient, large-scale model deployment.

Paper Structure

This paper contains 29 sections, 3 theorems, 16 equations, 4 figures, 10 tables, 2 algorithms.

Key Result

Lemma 3.1

Let ${\bm{W}} \in \mathbb{R}^{m \times n}$ with thin SVD ${\bm{W}} = {\bm{U}}{\bm{\Sigma}}{\bm{V}}^\top$. Then and the unique minimizer is ${\bm{W}}'^\star = \operatorname{SVD}_k({\bm{W}}) = {\bm{U}}_k{\bm{\Sigma}}_k{\bm{V}}_k^\top$, the truncation to the top-$k$ singular components. $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: Distortion (cosine distance) between intermediate features of the original and compressed model. Diagonal lines link each method's final-layer distortion to its WikiText2 perplexity. AA-SVD suppresses compression error consistently across depth.
  • Figure 2: Overview of the two-stage compression pipeline. Left: Four layer-wise compression objectives, differing in which inputs and outputs are compared. Input-agnostic: $\|{\bm{W}} - {\bm{W}}'\|_F^2$ — ignores activations entirely. Input-aware: $\|{\bm{W}}{\bm{X}} - {\bm{W}}'{\bm{X}}\|_F^2$ — matches outputs on original inputs ${\bm{X}}$. Shift-aware: $\|{\bm{W}}{\bm{X}}' - {\bm{W}}'{\bm{X}}'\|_F^2$ — matches outputs on the shifted inputs ${\bm{X}}'$ seen after upstream compression. Anchored adaptive (ours): $\|{\bm{W}}{\bm{X}} - {\bm{W}}'{\bm{X}}'\|_F^2$ — anchors the target to the original output while conditioning on the shifted input, combining an uncorrupted reference with distribution-shift awareness. Right: Block-level local refinement. Stage 1 factorizes all linear layers in the block independently via any layer-wise objective. Stage 2 then jointly optimizes all factorized weights to minimize the block-output error $\ell=\|\mathcal{L}({\bm{X}}) - \mathcal{L}'({\bm{X}}')\|_F^2$, keeping upstream blocks frozen — the same anchored adaptive spirit as but applied at block granularity. This lets the compressed layers within a block compensate for each other's residual errors, substantially recovering block-output fidelity.
  • Figure 3: Impact of calibration set size on compression performance. Performance is measured by perplexity on WikiText2 (left) and C4 (middle), and average accuracy across seven zero-shot reasoning tasks (right).
  • Figure 4: Layer-wise error evolution across LLaMA-7B at ratio $0.8$, evaluated on WikiText2 test split samples. Top row: MSE between original and compressed outputs. Bottom row: cosine distance between original and compressed outputs. Results are shown separately for attention output projections (O-proj), MLP-down projections, and full block outputs.

Theorems & Definitions (5)

  • Lemma 3.1: Eckart--Young--Mirsky
  • Theorem 3.2
  • proof
  • Corollary 3.3: No distribution shift
  • Remark : Rank-deficient ${\bm{B}}$