Table of Contents
Fetching ...

SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, Mi Zhang

TL;DR

SVD-LLMV2 tackles practical post-training compression of large language models by addressing two bottlenecks in prior SVD-based methods: layer-wise redundancy heterogeneity and unstable Cholesky-based truncation. It introduces heterogeneous compression ratio allocation guided by the theoretical minimum truncation loss $L_{min}$ and a loss-optimized two-round SVD truncation that achieves the same minimum loss without Cholesky instability. Empirically, SVD-LLMV2 consistently surpasses state-of-the-art SVD-based baselines across ten datasets and five LLMs, delivering notable perplexity reductions and accuracy gains, along with up to $2.71\times$ speedups on real hardware. The approach also demonstrates favorable comparisons with structured pruning and quantization, and can further improve performance when combined with quantization techniques, highlighting its practical impact for efficient, post-training LLM deployment.

Abstract

Despite significant advancements, the practical deployment of Large Language Models (LLMs) is often hampered by their immense sizes, highlighting the need for effective compression techniques. Singular Value Decomposition (SVD) is a promising LLM compression technique. However, existing SVD-based compression methods fall short in reducing truncation losses, leading to less competitive performance in compressed models. In this work, we introduce SVD-LLM V2, a SVD-based LLM compression method that optimizes singular value truncation in SVD compression with two techniques. First, SVD-LLM V2 proposes to use theoretical truncation loss of weight matrices to assign a unique compression ratio to each weight matrix at different layers to accommodate weight redundancy heterogeneity. Second, SVD-LLM V2 proposes loss-optimized weight truncation to ensure that the truncated singular values result in a lower and more stable truncation loss in practice. We evaluate SVD-LLM V2 on ten datasets and five LLMs at various scales. Our results show SVD-LLM V2 outperforms state-of-the-art SVD-based LLM compression methods. Our code is available at https://github.com/AIoT-MLSys-Lab/SVD-LLM

SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression

TL;DR

SVD-LLMV2 tackles practical post-training compression of large language models by addressing two bottlenecks in prior SVD-based methods: layer-wise redundancy heterogeneity and unstable Cholesky-based truncation. It introduces heterogeneous compression ratio allocation guided by the theoretical minimum truncation loss and a loss-optimized two-round SVD truncation that achieves the same minimum loss without Cholesky instability. Empirically, SVD-LLMV2 consistently surpasses state-of-the-art SVD-based baselines across ten datasets and five LLMs, delivering notable perplexity reductions and accuracy gains, along with up to speedups on real hardware. The approach also demonstrates favorable comparisons with structured pruning and quantization, and can further improve performance when combined with quantization techniques, highlighting its practical impact for efficient, post-training LLM deployment.

Abstract

Despite significant advancements, the practical deployment of Large Language Models (LLMs) is often hampered by their immense sizes, highlighting the need for effective compression techniques. Singular Value Decomposition (SVD) is a promising LLM compression technique. However, existing SVD-based compression methods fall short in reducing truncation losses, leading to less competitive performance in compressed models. In this work, we introduce SVD-LLM V2, a SVD-based LLM compression method that optimizes singular value truncation in SVD compression with two techniques. First, SVD-LLM V2 proposes to use theoretical truncation loss of weight matrices to assign a unique compression ratio to each weight matrix at different layers to accommodate weight redundancy heterogeneity. Second, SVD-LLM V2 proposes loss-optimized weight truncation to ensure that the truncated singular values result in a lower and more stable truncation loss in practice. We evaluate SVD-LLM V2 on ten datasets and five LLMs at various scales. Our results show SVD-LLM V2 outperforms state-of-the-art SVD-based LLM compression methods. Our code is available at https://github.com/AIoT-MLSys-Lab/SVD-LLM

Paper Structure

This paper contains 15 sections, 1 theorem, 3 equations, 6 figures, 8 tables, 2 algorithms.

Key Result

Theorem 3.1

If $U_s, S_s, V_s$ are obtained by SVD decomposition of $XX^T$ and $U_{ws}, S_{ws}, V_{ws}$ are obtained by SVD decomposition of $W\times U_s\times \sqrt{S_s}$, the compressed weight matrix $W' = U_{ws}\times \boldsymbol{\operatorname{Trunc.}}(S_{ws}) \times V_{ws} \times \sqrt{S_{s}}^{-1} \times U_

Figures (6)

  • Figure 1: Comparison between SVD-LLMV2 and SVD-LLM. We randomly select a weight matrix from LLaMA-3 8B and compare the normalized truncation loss and perplexity (PPL) under 20% compression ratio.
  • Figure 2: Overview of SVD-LLMV2.
  • Figure 3: Comparison between SVD-LLM and SVD-LLMV2 on the truncation loss of the query weight matrix across different layers in LlaMA-3 8B on WikiText-2 dataset with 50% compression ratio.
  • Figure 4: Perplexity on WikiText-2 and average accuracy on six classification datasets of LLaMA-7B compressed by SVD-LLMV2 and other SVD-based LLM compression baselines under 20% to 80% compression ratios. The perplexity values of FWSVD and ASVD are larger than 100, thus are not shown in the figure.
  • Figure 5: Throughput (Tokens/s) achieved by original LLaMA-7B and its compressed version by SVD-LLMV2 under different compression ratios on a single NVIDIA A100 GPU. We fix the batch size to 4, prefill length to 1024, and decoding length to 256. The speedup over the original LLM is marked in red.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • proof