Table of Contents
Fetching ...

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

Ke Yi, Yuhui Xu, Heng Chang, Chen Tang, Yuan Meng, Tong Zhang, Jia Li

TL;DR

The paper tackles the challenge of deploying LLMs under diverse resource constraints without repeating costly retraining. It introduces LLM-QFA, a one-shot quantization-aware training framework that builds a layer-wise mixed-precision supernet and decouples configuration-specific weights using Low-Rank adapters, augmented by a non-parametric, resource-balanced scheduler. A search component identifies optimal subnets under given budgets without extra retraining, yielding multiple high-performing 2/3/4-bit configurations for LLaMA2-7b/13b. Empirical results on MMLU and Common Sense QA demonstrate maintained accuracy with significantly reduced deployment time, suggesting scalable applicability to larger models and real-world multi-scenario deployments.

Abstract

Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from quantization loss. However, deploying LLMs across diverse scenarios with different resource constraints, e.g., servers and personal computers, requires repeated training per application, which amplifies the lengthy training problem. Given that, it is advantageous to train a once-for-all (OFA) supernet capable of yielding diverse optimal subnets for downstream applications through one-shot training. Nonetheless, the scale of current language models impedes efficiency and amplifies interference from weight sharing between subnets. We make an initial attempt to extend the once-for-all framework to large language models. Specifically, we decouple shared weights to eliminate the interference and incorporate Low-Rank adapters for training efficiency. Furthermore, we observe the imbalance allocation of training resources from the traditional uniform sampling. A non-parametric scheduler is introduced to adjust the sampling rate for each quantization configuration, achieving a more balanced allocation among subnets with varying demands. We validate the approach on LLaMA2 families, and downstream evaluation confirms our ability to maintain high performance while significantly reducing deployment time faced with multiple scenarios.

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

TL;DR

The paper tackles the challenge of deploying LLMs under diverse resource constraints without repeating costly retraining. It introduces LLM-QFA, a one-shot quantization-aware training framework that builds a layer-wise mixed-precision supernet and decouples configuration-specific weights using Low-Rank adapters, augmented by a non-parametric, resource-balanced scheduler. A search component identifies optimal subnets under given budgets without extra retraining, yielding multiple high-performing 2/3/4-bit configurations for LLaMA2-7b/13b. Empirical results on MMLU and Common Sense QA demonstrate maintained accuracy with significantly reduced deployment time, suggesting scalable applicability to larger models and real-world multi-scenario deployments.

Abstract

Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from quantization loss. However, deploying LLMs across diverse scenarios with different resource constraints, e.g., servers and personal computers, requires repeated training per application, which amplifies the lengthy training problem. Given that, it is advantageous to train a once-for-all (OFA) supernet capable of yielding diverse optimal subnets for downstream applications through one-shot training. Nonetheless, the scale of current language models impedes efficiency and amplifies interference from weight sharing between subnets. We make an initial attempt to extend the once-for-all framework to large language models. Specifically, we decouple shared weights to eliminate the interference and incorporate Low-Rank adapters for training efficiency. Furthermore, we observe the imbalance allocation of training resources from the traditional uniform sampling. A non-parametric scheduler is introduced to adjust the sampling rate for each quantization configuration, achieving a more balanced allocation among subnets with varying demands. We validate the approach on LLaMA2 families, and downstream evaluation confirms our ability to maintain high performance while significantly reducing deployment time faced with multiple scenarios.
Paper Structure (21 sections, 6 equations, 8 figures, 2 tables)

This paper contains 21 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: (a) Compressing Large Language Models (LLMs) for deployment across various platforms while ensuring performance is a challenging task. Applying Quantization-Aware Training (QAT) for each platform is both time-consuming and costly. (b) Instead, our objective is to one-shot fine-tune one quantized LLM that can be efficiently specialized for multiple platforms. The one-shot fine-tuning process significantly reduces the investment. (c) The LLM-QFA framework excels in swiftly delivering optimal networks under different resource constraints in one shot, whereas the traditional method requires repeated fine-tuning.
  • Figure 2: An illustration of the goal of LLM-QFA. Compared with traditional OFA with Quantization-Aware Training, our approach circumvents interference issues by decoupling shared weight and incorporating the Low-Rank Adapter to further enhance the training efficiency. More notably, we employ a resource-balance sampling strategy to expedite the convergence of subnets across resource constraints.
  • Figure 3: (a) Distribution of average bit-width of samples obtained from uniform sampling, approximating a low variance Gaussian distribution. (b) Mixed Gaussian Distribution can approximate Uniform Distribution. (c) Showcase of our Resource-Balance sampling strategy.
  • Figure 4: Left: The time required to obtain N specialized networks varies across methods. Our proposed QFA approach significantly reduces the time cost compared to the QA-LoRA method and achieves a comparable efficiency level to the pure quantization technique, GPTQ. Right: For each method, we obtain three specialized networks under (2, 3, 4) bit constraints on the LLaMA2-7b and LLaMA2-13B models. The average accuracy on the $5$-shot MMLU benchmark for networks quantized at (2, 3, 4) bits is reported. Although GPTQ can achieve a lower time cost, it is accompanied by an unacceptable level of performance degradation. Full results are provided in Table \ref{['tab: mmlu']}.
  • Figure 5: LLM-QFA can deliver multiple optimal subnets under different constraints. Left: Comparison of ARC-C dataset; Right: Comparison of the rest of Common Sense QA tasks.
  • ...and 3 more figures