QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

Zhikai Li; Xiaoxuan Liu; Banghua Zhu; Zhen Dong; Qingyi Gu; Kurt Keutzer

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt Keutzer

TL;DR

QFT addresses the high memory demands of full-parameter fine-tuning for large language models by quantizing all training states to INT8, enabling end-to-end training on affordable hardware. It combines a theoretically robust Lion-based quantization approach with a hybrid feature quantizer to preserve sparse critical weights, plus a stack-based O(1) gradient flow for integer backpropagation. Empirical results on LLaMA-2 models show memory reductions to roughly 21% of FP32 training, with performance comparable to full-precision fine-tuning, albeit with a modest time overhead. The work demonstrates that full-parameter fine-tuning can be practical on commodity GPUs, potentially broadening access to LLM adaptation in resource-constrained environments.

Abstract

Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pretrained models on downstream datasets provides further significant performance gains; however, this process typically requires a large number of expensive, high-end GPUs. Although there have been efforts focused on parameter-efficient fine-tuning, they cannot fully unlock the powerful potential of full-parameter fine-tuning. In this paper, we propose QFT, a Quantized Full-parameter Tuning framework for LLMs that quantizes and stores all training states, including weights, gradients, and optimizer states, in INT8 format to reduce training memory, thereby enabling full-parameter fine-tuning on existing GPUs at an affordable cost. To ensure training performance, we make two key efforts: i) for quantized gradients and optimizer states, we theoretically prove that the Lion optimizer, with its property of consistent update magnitudes, is highly robust to quantization; ii) and for quantized weights, we employ the hybrid feature quantizer, which identifies and protects a small subset of sparse critical features while quantizing the remaining dense features, thus ensuring accurate weight updates without FP32 backups. Moreover, to support backpropagation in the integer context, we develop a stack-based gradient flow scheme with O(1) complexity, forming a unified integer training pipeline. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only <30GB of memory, making it feasible on a single A6000 GPU.

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

TL;DR

Abstract

Paper Structure (15 sections, 1 theorem, 9 equations, 7 figures, 5 tables, 3 algorithms)

This paper contains 15 sections, 1 theorem, 9 equations, 7 figures, 5 tables, 3 algorithms.

Introduction
Related Work
Methodology
Lion Optimizer: Robust Quantization of Gradients and Momentum
Hybrid Feature Quantizer: Accurate Updates of Quantized Weights
The Integer Training Pipeline
Experiments
Experimental Setup
Memory Profile
Performance Evaluation
Conclusion
The Standard Lion Procedure
Analysis of Values of $\frac{|\Delta|}{\sigma_\delta}$
Discussion on Outlier Thresholds of Weight Quantizer
Qualitative Analysis of Conversational Abilities

Key Result

Lemma 1

Under Assumption assump, when quantizing gradients and momentum in Lion, if the increment $\Delta$ satisfies $|\Delta|\ge 1.645\sqrt{\beta_1^2\sigma_m^2+(1-\beta_1)^2\sigma_g^2}$, then with 95% probability, $\mathrm{sign}(\Delta)$ remains invariant under quantization.

Figures (7)

Figure 1: Comparison in GPU memory usage of different full-parameter fine-tuning methods, including standard FP32 Adam kingma2015adam, mixed-precision FP16 Adam micikevicius2017mixed, BitsandBytes dettmers20218, and the proposed QFT. QFT significantly reduces training memory, enabling fine-tuning with affordable resources. To ensure the performance of quantized fine-tuning, QFT adopts the hybrid feature quantizer for weights, and for gradients and momentum, we theoretically prove that Lion exhibits high robustness to quantization, thereby ensuring comparable convergence to FP32 Adam.
Figure 2: Comparison between our QFT and traditional QAT in the computation and update procedures of weights. QAT stores the weights in the floating-point format and adds fake quantization nodes to the computation. Conversely, in our QFT, the weights are stored in the low-precision integer format, which are de-quantized on-the-fly into the floating-point format for computation, resulting in a significant reduction in memory usage.
Figure 3: Illustration of the model state distributions when training a LLaMA-2-7B model. The weight values are from the final down projection layer, and the gradient and momentum values are fetched on the 200th training step. The gradients and momentum show a canonical centralized distribution with few outliers, while the range of the weights increases by three orders of magnitude and exhibits extreme outliers, posing a significant challenge.
Figure 4: The proposed stack-based gradient flow scheme, which enables storage and $O$(1) complexity access to integer gradients. This effectively eliminates AutoGrad's dependency on floating-point formats, enabling efficient gradient propagation in the context of integer weights.
Figure 5: Comparison of training loss curves.
...and 2 more figures

Theorems & Definitions (3)

Lemma 1
proof
Remark 1

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

TL;DR

Abstract

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (3)