Table of Contents
Fetching ...

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

Bowen Ping, Shuo Wang, Hanqing Wang, Xu Han, Yuzhuang Xu, Yukun Yan, Yun Chen, Baobao Chang, Zhiyuan Liu, Maosong Sun

TL;DR

This work proposes a delta quantization approach using mixed-precision that employs higher-bit representation for singular vectors corresponding to larger singular values in the delta weights, motivated by the long-tail distribution of singular values in the delta weights.

Abstract

Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability.

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

TL;DR

This work proposes a delta quantization approach using mixed-precision that employs higher-bit representation for singular vectors corresponding to larger singular values in the delta weights, motivated by the long-tail distribution of singular values in the delta weights.

Abstract

Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability.
Paper Structure (30 sections, 8 equations, 4 figures, 11 tables)

This paper contains 30 sections, 8 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Left: illustration of BitDelta liu2024bitdelta, which employs 1-bit quantization for all the delta weights. Middle: illustration of low-rank compression ryu2023lowrank, retaining the top-$k$ singular values and the corresponding singular vectors. Right: illustration of the proposed Delta-CoMe method, which represents the singular vectors of larger singular values using high-bit vectors while compressing the singular vectors of smaller singular values into low-bit representations. This method is inspired by the long-tail distribution of singular values in delta weights.
  • Figure 2: Illustration of Delta-CoMe, where we utilize varying bit-widths for singular vectors with different singular values. Singular vectors corresponding to larger singular values are assigned higher bit-widths. For extremely small singular values, we omit the singular vectors (i.e., 0-bit).
  • Figure 3: Inference time of the PyTorch and Triton implementation of Delta-CoMe.
  • Figure 4: Case study for different delta-compression methods, where only the triple-precision compression method proposed in this work can give the correct answer.