Table of Contents
Fetching ...

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt Keutzer

TL;DR

QFT addresses the high memory demands of full-parameter fine-tuning for large language models by quantizing all training states to INT8, enabling end-to-end training on affordable hardware. It combines a theoretically robust Lion-based quantization approach with a hybrid feature quantizer to preserve sparse critical weights, plus a stack-based O(1) gradient flow for integer backpropagation. Empirical results on LLaMA-2 models show memory reductions to roughly 21% of FP32 training, with performance comparable to full-precision fine-tuning, albeit with a modest time overhead. The work demonstrates that full-parameter fine-tuning can be practical on commodity GPUs, potentially broadening access to LLM adaptation in resource-constrained environments.

Abstract

Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pretrained models on downstream datasets provides further significant performance gains; however, this process typically requires a large number of expensive, high-end GPUs. Although there have been efforts focused on parameter-efficient fine-tuning, they cannot fully unlock the powerful potential of full-parameter fine-tuning. In this paper, we propose QFT, a Quantized Full-parameter Tuning framework for LLMs that quantizes and stores all training states, including weights, gradients, and optimizer states, in INT8 format to reduce training memory, thereby enabling full-parameter fine-tuning on existing GPUs at an affordable cost. To ensure training performance, we make two key efforts: i) for quantized gradients and optimizer states, we theoretically prove that the Lion optimizer, with its property of consistent update magnitudes, is highly robust to quantization; ii) and for quantized weights, we employ the hybrid feature quantizer, which identifies and protects a small subset of sparse critical features while quantizing the remaining dense features, thus ensuring accurate weight updates without FP32 backups. Moreover, to support backpropagation in the integer context, we develop a stack-based gradient flow scheme with O(1) complexity, forming a unified integer training pipeline. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only <30GB of memory, making it feasible on a single A6000 GPU.

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

TL;DR

QFT addresses the high memory demands of full-parameter fine-tuning for large language models by quantizing all training states to INT8, enabling end-to-end training on affordable hardware. It combines a theoretically robust Lion-based quantization approach with a hybrid feature quantizer to preserve sparse critical weights, plus a stack-based O(1) gradient flow for integer backpropagation. Empirical results on LLaMA-2 models show memory reductions to roughly 21% of FP32 training, with performance comparable to full-precision fine-tuning, albeit with a modest time overhead. The work demonstrates that full-parameter fine-tuning can be practical on commodity GPUs, potentially broadening access to LLM adaptation in resource-constrained environments.

Abstract

Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pretrained models on downstream datasets provides further significant performance gains; however, this process typically requires a large number of expensive, high-end GPUs. Although there have been efforts focused on parameter-efficient fine-tuning, they cannot fully unlock the powerful potential of full-parameter fine-tuning. In this paper, we propose QFT, a Quantized Full-parameter Tuning framework for LLMs that quantizes and stores all training states, including weights, gradients, and optimizer states, in INT8 format to reduce training memory, thereby enabling full-parameter fine-tuning on existing GPUs at an affordable cost. To ensure training performance, we make two key efforts: i) for quantized gradients and optimizer states, we theoretically prove that the Lion optimizer, with its property of consistent update magnitudes, is highly robust to quantization; ii) and for quantized weights, we employ the hybrid feature quantizer, which identifies and protects a small subset of sparse critical features while quantizing the remaining dense features, thus ensuring accurate weight updates without FP32 backups. Moreover, to support backpropagation in the integer context, we develop a stack-based gradient flow scheme with O(1) complexity, forming a unified integer training pipeline. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only <30GB of memory, making it feasible on a single A6000 GPU.
Paper Structure (15 sections, 1 theorem, 9 equations, 7 figures, 5 tables, 3 algorithms)

This paper contains 15 sections, 1 theorem, 9 equations, 7 figures, 5 tables, 3 algorithms.

Key Result

Lemma 1

Under Assumption assump, when quantizing gradients and momentum in Lion, if the increment $\Delta$ satisfies $|\Delta|\ge 1.645\sqrt{\beta_1^2\sigma_m^2+(1-\beta_1)^2\sigma_g^2}$, then with 95% probability, $\mathrm{sign}(\Delta)$ remains invariant under quantization.

Figures (7)

  • Figure 1: Comparison in GPU memory usage of different full-parameter fine-tuning methods, including standard FP32 Adam kingma2015adam, mixed-precision FP16 Adam micikevicius2017mixed, BitsandBytes dettmers20218, and the proposed QFT. QFT significantly reduces training memory, enabling fine-tuning with affordable resources. To ensure the performance of quantized fine-tuning, QFT adopts the hybrid feature quantizer for weights, and for gradients and momentum, we theoretically prove that Lion exhibits high robustness to quantization, thereby ensuring comparable convergence to FP32 Adam.
  • Figure 2: Comparison between our QFT and traditional QAT in the computation and update procedures of weights. QAT stores the weights in the floating-point format and adds fake quantization nodes to the computation. Conversely, in our QFT, the weights are stored in the low-precision integer format, which are de-quantized on-the-fly into the floating-point format for computation, resulting in a significant reduction in memory usage.
  • Figure 3: Illustration of the model state distributions when training a LLaMA-2-7B model. The weight values are from the final down projection layer, and the gradient and momentum values are fetched on the 200th training step. The gradients and momentum show a canonical centralized distribution with few outliers, while the range of the weights increases by three orders of magnitude and exhibits extreme outliers, posing a significant challenge.
  • Figure 4: The proposed stack-based gradient flow scheme, which enables storage and $O$(1) complexity access to integer gradients. This effectively eliminates AutoGrad's dependency on floating-point formats, enabling efficient gradient propagation in the context of integer weights.
  • Figure 5: Comparison of training loss curves.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Lemma 1
  • proof
  • Remark 1