BitDelta: Your Fine-Tune May Only Be Worth One Bit
James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai
TL;DR
BitDelta proposes a simple yet effective post-training quantization of the fine-tuning delta between a base LLM and its fine-tuned version, compressing the delta to 1 bit with a trainable scaling factor and a subsequent scale distillation step. By representing multiple fine-tuned models as a single high-precision base plus numerous 1-bit deltas, it dramatically reduces GPU memory and can translate to latency gains in multi-tenant serving without large drops in accuracy. The method is validated across Llama-2, Mistral, and MPT families up to 70B parameters, outperforming low-rank delta approximations and remaining effective even when the base model is quantized. Scale distillation is key to recovering performance, enabling BitDelta to preserve fine-tune information across diverse tasks such as TruthfulQA, GSM8K, and MT-Bench. Overall, BitDelta enables efficient, scalable deployment of many fine-tuned models on shared infrastructure, with practical implications for storage, loading times, and generation latency in real-world systems.
Abstract
Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.
