Table of Contents
Fetching ...

BitDelta: Your Fine-Tune May Only Be Worth One Bit

James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai

TL;DR

BitDelta proposes a simple yet effective post-training quantization of the fine-tuning delta between a base LLM and its fine-tuned version, compressing the delta to 1 bit with a trainable scaling factor and a subsequent scale distillation step. By representing multiple fine-tuned models as a single high-precision base plus numerous 1-bit deltas, it dramatically reduces GPU memory and can translate to latency gains in multi-tenant serving without large drops in accuracy. The method is validated across Llama-2, Mistral, and MPT families up to 70B parameters, outperforming low-rank delta approximations and remaining effective even when the base model is quantized. Scale distillation is key to recovering performance, enabling BitDelta to preserve fine-tune information across diverse tasks such as TruthfulQA, GSM8K, and MT-Bench. Overall, BitDelta enables efficient, scalable deployment of many fine-tuned models on shared infrastructure, with practical implications for storage, loading times, and generation latency in real-world systems.

Abstract

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

BitDelta: Your Fine-Tune May Only Be Worth One Bit

TL;DR

BitDelta proposes a simple yet effective post-training quantization of the fine-tuning delta between a base LLM and its fine-tuned version, compressing the delta to 1 bit with a trainable scaling factor and a subsequent scale distillation step. By representing multiple fine-tuned models as a single high-precision base plus numerous 1-bit deltas, it dramatically reduces GPU memory and can translate to latency gains in multi-tenant serving without large drops in accuracy. The method is validated across Llama-2, Mistral, and MPT families up to 70B parameters, outperforming low-rank delta approximations and remaining effective even when the base model is quantized. Scale distillation is key to recovering performance, enabling BitDelta to preserve fine-tune information across diverse tasks such as TruthfulQA, GSM8K, and MT-Bench. Overall, BitDelta enables efficient, scalable deployment of many fine-tuned models on shared infrastructure, with practical implications for storage, loading times, and generation latency in real-world systems.

Abstract

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.
Paper Structure (31 sections, 6 equations, 6 figures, 10 tables)

This paper contains 31 sections, 6 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Overview of BitDelta. BitDelta applies 1-bit quantization to the weight delta between fine-tuned and base models. For each weight matrix, we quantize its delta as its sign bits and a trainable high-precision scale factor. The scale factor is initialized to achieve the best approximation error in $L_2$ norm and further refined with a few distillation steps. BitDelta shows minimal degradation in model performance and reduces memory consumption in multi-tenancy serving by representing multiple fine-tuned models with a single high-precision base model and multiple 1-bit deltas.
  • Figure 2: Cumulative Explained Variance (CEV) plot of a $4096 \times 4096$ weight delta between Llama 2-7B and Vicuna-7B v1.5. Deltas from full parameter fine-tuning are fairly high rank, making low-rank approximations difficult.
  • Figure 3: As the fidelity of $\Delta$ increases, the TruthfulQA scores of Llama 2-7B + $\Delta$ approaches that of Vicuna-7B v1.5.
  • Figure 4: Decoding latency of a linear layer, as in Eqn. \ref{['eqn:kernel_decomp']}. Black: Shared base weight backbone $W_\text{base}X$. Blue: Batched activation-product with $B$ 1-bit deltas, as in BitDelta. Red: Batched activation-product with $B$ low-rank deltas, as in S-LoRA. Left: Ablation over hidden size, assuming $N=M$ and $B=1$. Right: Ablation over batch size, assuming $N=M=4096$.
  • Figure 5: Memory usage of Llama 2-7B, assuming each sequence in the batch has a length of $128$. Blue: Memory usage of the naive method, separately storing $B$ distinct fine-tuned models. Orange: Projected values for the naive method. Green: Memory usage of BitDelta. The naive forward pass succumbs to GPU memory issues at higher batch sizes.
  • ...and 1 more figures