Enhancing Delta Compression in LLMs via SVD-based Quantization Error Minimization
Boya Xiong, Shuo Wang, Weifeng Ge, Guanhua Chen, Yun Chen
TL;DR
PrinMix provides a principled SVD-based delta compression framework that minimizes quantization error under a bit-budget by modeling quantization as a 0/1 ILP optimization and introducing Reconstruction Target Correction to mitigate sequential quantization bias. The method derives a theoretical necessity for mixed-precision in the V-part of the SVD and allocates bits to singular vectors via an ILP, while applying RTC to refine U quantization. Empirical results across reasoning, math, code, and multimodal tasks show PrinMix outperforming state-of-the-art delta-compression baselines (notably Delta-CoMe) on 7B and 13–14B models and enabling substantial memory and deployment efficiency in multi-tenant settings. The work demonstrates practical impact for scalable LLM serving, offering robust performance under varied calibration data and compression ratios, with RTC contributing notable improvements in challenging tasks.
Abstract
Supervised Fine-Tuning (SFT) empowers Large Language Models (LLMs) with exceptional performance on specialized tasks, but it yields dense, high-dimensional delta parameters that pose severe storage and distribution challenges. Singular Value Decomposition (SVD)-based compression offers a compact representation for such delta parameters, but existing methods adopt heuristic quantization without clarifying underlying mechanisms, leading to poor generalizability. In this work, we propose PrinMix, a rigorous SVD-based framework that models quantization as an optimization problem, grounding the design in mathematical mechanisms. We first theoretically derive quantization error and identify a key singular-value-dominated scaling mechanism, which mathematically proves the necessity of mix-precision quantization. We then model the quantization scheme as a 0/1 Integer Linear Programming (ILP) problem, which yields optimal bit-budget-constrained solutions without empirical assumptions. Furthermore, PrinMix integrates a Reconstruction Target Correction (RTC) method to compensate for errors from the $\mathbf{V}$-then-$\mathbf{U}$ sequential quantization process. Extensive experiments confirm PrinMix performs well: for 7B LLMs, PrinMix outperforms SOTA Delta-CoMe on challenging benchmarks by 22.3% on AIME2024 and 6.1% on GQA.
