Table of Contents
Fetching ...

DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

Xiaozhe Yao, Qinghao Hu, Ana Klimovic

TL;DR

DeltaZip tackles the challenge of serving many fine-tuned LLM variants by exploiting the small perturbations introduced during full-model fine-tuning. It introduces a hardware-friendly delta compression pipeline, $\Delta$Compress, and a co-designed serving stack that decouples base and delta inference and accelerates low-precision sparse computations with the SBMM kernel. The system achieves up to $10\\times$ compression and $2\\times$ to $12\\times$ throughput improvements while maintaining accuracy comparable to FP16 FMT, outperforming direct full-model compression baselines and PEFT-based approaches in many tasks. This work lowers the cost and latency of multi-variant LLM serving, enabling efficient hosting of large numbers of specialized models in real-world deployment. It also provides open-source implementations to facilitate adoption and further research.

Abstract

Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by up to 10x while maintaining high model quality. The key insight behind this design is that fine-tuning results in small-magnitude changes to the pre-trained model. By co-designing the serving system with the compression algorithm, DeltaZip achieves 2x to 12x improvement in throughput compared to the state-of-the-art systems.

DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

TL;DR

DeltaZip tackles the challenge of serving many fine-tuned LLM variants by exploiting the small perturbations introduced during full-model fine-tuning. It introduces a hardware-friendly delta compression pipeline, Compress, and a co-designed serving stack that decouples base and delta inference and accelerates low-precision sparse computations with the SBMM kernel. The system achieves up to compression and to throughput improvements while maintaining accuracy comparable to FP16 FMT, outperforming direct full-model compression baselines and PEFT-based approaches in many tasks. This work lowers the cost and latency of multi-variant LLM serving, enabling efficient hosting of large numbers of specialized models in real-world deployment. It also provides open-source implementations to facilitate adoption and further research.

Abstract

Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by up to 10x while maintaining high model quality. The key insight behind this design is that fine-tuning results in small-magnitude changes to the pre-trained model. By co-designing the serving system with the compression algorithm, DeltaZip achieves 2x to 12x improvement in throughput compared to the state-of-the-art systems.
Paper Structure (35 sections, 2 equations, 19 figures, 2 tables, 1 algorithm)

This paper contains 35 sections, 2 equations, 19 figures, 2 tables, 1 algorithm.

Figures (19)

  • Figure 1: Invocation counts per 5-min time windows for 20 different models in the LMSys Chatbot Arena zheng_judging_2023 trace.
  • Figure 2: LoRA vs. full-model fine-tuning accuracy anyscale_fine-tuning_2023lora-learns-forgets-biderman2024lora. LoRA fine-tuning is comparable for some tasks (SQL), but has lower quality on more complex tasks (Math and Code).
  • Figure 3: Flattened weight matrix in an intermediate layer of the pre-trained model (a), the fine-tuned model (b), and the model delta between the two (bottom, (b)-(a)).
  • Figure 4: DeltaZip system architecture.
  • Figure 5: DeltaZip Compression Pipeline. The pipeline consists of delta extraction, sparsification & quantization, and optionally lossless compression. The compressed delta is stored as a dictionary of compressed weight matrices and metadata.
  • ...and 14 more figures