Table of Contents
Fetching ...

Enhancing Delta Compression in LLMs via SVD-based Quantization Error Minimization

Boya Xiong, Shuo Wang, Weifeng Ge, Guanhua Chen, Yun Chen

TL;DR

PrinMix provides a principled SVD-based delta compression framework that minimizes quantization error under a bit-budget by modeling quantization as a 0/1 ILP optimization and introducing Reconstruction Target Correction to mitigate sequential quantization bias. The method derives a theoretical necessity for mixed-precision in the V-part of the SVD and allocates bits to singular vectors via an ILP, while applying RTC to refine U quantization. Empirical results across reasoning, math, code, and multimodal tasks show PrinMix outperforming state-of-the-art delta-compression baselines (notably Delta-CoMe) on 7B and 13–14B models and enabling substantial memory and deployment efficiency in multi-tenant settings. The work demonstrates practical impact for scalable LLM serving, offering robust performance under varied calibration data and compression ratios, with RTC contributing notable improvements in challenging tasks.

Abstract

Supervised Fine-Tuning (SFT) empowers Large Language Models (LLMs) with exceptional performance on specialized tasks, but it yields dense, high-dimensional delta parameters that pose severe storage and distribution challenges. Singular Value Decomposition (SVD)-based compression offers a compact representation for such delta parameters, but existing methods adopt heuristic quantization without clarifying underlying mechanisms, leading to poor generalizability. In this work, we propose PrinMix, a rigorous SVD-based framework that models quantization as an optimization problem, grounding the design in mathematical mechanisms. We first theoretically derive quantization error and identify a key singular-value-dominated scaling mechanism, which mathematically proves the necessity of mix-precision quantization. We then model the quantization scheme as a 0/1 Integer Linear Programming (ILP) problem, which yields optimal bit-budget-constrained solutions without empirical assumptions. Furthermore, PrinMix integrates a Reconstruction Target Correction (RTC) method to compensate for errors from the $\mathbf{V}$-then-$\mathbf{U}$ sequential quantization process. Extensive experiments confirm PrinMix performs well: for 7B LLMs, PrinMix outperforms SOTA Delta-CoMe on challenging benchmarks by 22.3% on AIME2024 and 6.1% on GQA.

Enhancing Delta Compression in LLMs via SVD-based Quantization Error Minimization

TL;DR

PrinMix provides a principled SVD-based delta compression framework that minimizes quantization error under a bit-budget by modeling quantization as a 0/1 ILP optimization and introducing Reconstruction Target Correction to mitigate sequential quantization bias. The method derives a theoretical necessity for mixed-precision in the V-part of the SVD and allocates bits to singular vectors via an ILP, while applying RTC to refine U quantization. Empirical results across reasoning, math, code, and multimodal tasks show PrinMix outperforming state-of-the-art delta-compression baselines (notably Delta-CoMe) on 7B and 13–14B models and enabling substantial memory and deployment efficiency in multi-tenant settings. The work demonstrates practical impact for scalable LLM serving, offering robust performance under varied calibration data and compression ratios, with RTC contributing notable improvements in challenging tasks.

Abstract

Supervised Fine-Tuning (SFT) empowers Large Language Models (LLMs) with exceptional performance on specialized tasks, but it yields dense, high-dimensional delta parameters that pose severe storage and distribution challenges. Singular Value Decomposition (SVD)-based compression offers a compact representation for such delta parameters, but existing methods adopt heuristic quantization without clarifying underlying mechanisms, leading to poor generalizability. In this work, we propose PrinMix, a rigorous SVD-based framework that models quantization as an optimization problem, grounding the design in mathematical mechanisms. We first theoretically derive quantization error and identify a key singular-value-dominated scaling mechanism, which mathematically proves the necessity of mix-precision quantization. We then model the quantization scheme as a 0/1 Integer Linear Programming (ILP) problem, which yields optimal bit-budget-constrained solutions without empirical assumptions. Furthermore, PrinMix integrates a Reconstruction Target Correction (RTC) method to compensate for errors from the -then- sequential quantization process. Extensive experiments confirm PrinMix performs well: for 7B LLMs, PrinMix outperforms SOTA Delta-CoMe on challenging benchmarks by 22.3% on AIME2024 and 6.1% on GQA.

Paper Structure

This paper contains 41 sections, 12 equations, 4 figures, 16 tables, 1 algorithm.

Figures (4)

  • Figure 1: An overview of PrinMix compared to single-precision quantization. Given the quantization scheme (➀), we compute the "difference" term (➂) by looking up the corresponding values in subfigure ➁. The quantization error of the $i$-th row of $\mathbf{V}$ comprise two components: a "scaling" term (➃) and a "difference" term (➂). PrinMix identifies the optimal quantization scheme within the constraints of the bit budget (➀) to effectively balance these two components, thereby minimizing the total quantization error of $\mathbf{V}$ (➄). Note that the "difference" term for various bit-widths (➁) is pre-computed using a calibration dataset and remains fixed during the optimization process.
  • Figure 2: (Left) The value of "scaling" term (Eq. \ref{['eq: v_loss']}) at different row indices. (Right) The value of "difference" term ((Eq. \ref{['eq: v_loss']}) with different quantization bit-width at different row indices. We compute all results using Q_Proj at the last layer of Qwen2.5-Math-7B-Instruct.
  • Figure 3: End-to-end decoding latency evaluation with varying numbers of deployed models using Qwen2.5-7B variants. (Left) Decoding memory usage. (Middle) Prefill time. (Right) Generation speed.
  • Figure 4: GPU memory usage with quantization bits across layers of Qwen2.5-Math-7B-Instruct.