Table of Contents
Fetching ...

Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives

Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, Chenfeng Xu

TL;DR

Dobi-SVD introduces a principled, differentiable SVD-based compression framework for LLMs that prioritizes activation truncation over weight truncation, paired with an IPCA-based weight update and a novel remapped storage scheme. By smoothing truncation to a learnable position, applying Incremental PCA to derive the optimal rank-k weight, and remapping storage to achieve a bijective compression ratio, the method overcomes longstanding SVD limitations and achieves strong task performance with minimal degradation at high compression. Empirical results on LLaMA-family models show state-of-the-art SVD-based compression, substantial hardware speedups, and compatibility with quantization, while extension to vision-language and vision-language-action models demonstrates generality. The work has practical implications for deploying large models on resource-constrained hardware, edge devices, and robotics, where memory and compute efficiency are critical.

Abstract

We provide a new LLM-compression solution via SVD, unlocking new possibilities for LLM compression beyond quantization and pruning. We point out that the optimal use of SVD lies in truncating activations, rather than merely using activations as an optimization distance. Building on this principle, we address three critical challenges in SVD-based LLM compression: including (1) How can we determine the optimal activation truncation position for each weight matrix in LLMs? (2) How can we efficiently reconstruct the weight matrices based on truncated activations? (3) How can we address the inherent "injection" nature that results in the information loss of the SVD? We propose Dobi-SVD, which establishes a new, principled approach to SVD-based LLM compression.

Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives

TL;DR

Dobi-SVD introduces a principled, differentiable SVD-based compression framework for LLMs that prioritizes activation truncation over weight truncation, paired with an IPCA-based weight update and a novel remapped storage scheme. By smoothing truncation to a learnable position, applying Incremental PCA to derive the optimal rank-k weight, and remapping storage to achieve a bijective compression ratio, the method overcomes longstanding SVD limitations and achieves strong task performance with minimal degradation at high compression. Empirical results on LLaMA-family models show state-of-the-art SVD-based compression, substantial hardware speedups, and compatibility with quantization, while extension to vision-language and vision-language-action models demonstrates generality. The work has practical implications for deploying large models on resource-constrained hardware, edge devices, and robotics, where memory and compute efficiency are critical.

Abstract

We provide a new LLM-compression solution via SVD, unlocking new possibilities for LLM compression beyond quantization and pruning. We point out that the optimal use of SVD lies in truncating activations, rather than merely using activations as an optimization distance. Building on this principle, we address three critical challenges in SVD-based LLM compression: including (1) How can we determine the optimal activation truncation position for each weight matrix in LLMs? (2) How can we efficiently reconstruct the weight matrices based on truncated activations? (3) How can we address the inherent "injection" nature that results in the information loss of the SVD? We propose Dobi-SVD, which establishes a new, principled approach to SVD-based LLM compression.

Paper Structure

This paper contains 40 sections, 10 equations, 13 figures, 27 tables, 5 algorithms.

Figures (13)

  • Figure 1: Overview framework of Dobi-SVD: 1-3: Differentiable Truncation Position Training. By applying parameter renormalization for continuous rank ratio selection and using Taylor expansion to prevent gradient explosion, our method enables robust and adaptive optimization of truncation positions. 4: Weight Update. Using IPCA, we sequentially extract and optimally update weight matrix features. 5: Remapping. We resolve a long-overlooked limitation of traditional SVD-based compression through remapping, fully unlocking SVD’s potential for data compression.
  • Figure 2: The differences between Dobi-SVD's method and previous approaches in handling activations and obtaining new weights. See \ref{['sec_appn_novelPath']} for a detailed explanation of this figure.
  • Figure 3: (Left) Performance Comparison of different training methods on LLaMA-7b. For activation truncation (multi-layer) we only truncate layers 29-31, and for activation truncation (single-layer) we only truncate the 29-th layer. (Middle) Comparison of model performance using batch size = 256 and 16 for training. (Right) Comparison of memory requirements for PCA and IPCA for $n*n$ matrix.
  • Figure 4: Tokens/sec of original LLaMA-7B and its compressed version by Dobi-SVD under 40%, 60% and 80% compression ratio on single A100 GPU. (a): comparison with different batch size while sequence length = 32. (b): comparison with different sequence length while batch size = 64.
  • Figure 5: Data distribution of Attention Q matrix of Llama-7b layer 20.
  • ...and 8 more figures