Table of Contents
Fetching ...

QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

Jiajun Zhou, Yifan Yang, Kai Zhen, Ziyue Liu, Yequan Zhao, Ershad Banijamali, Athanasios Mouchtaris, Ngai Wong, Zheng Zhang

TL;DR

This work tackles the challenge of fine-tuning large language models on low-precision hardware without backpropagation. It introduces QuZO, a Quantized Zeroth-Order optimizer that uses a novel two-perturbation, stochastic-quantization gradient estimator to perform forward-only updates with $4$- or $8$-bit precision, avoiding the straight-through estimator. The method demonstrates superior or competitive accuracy compared to first-order quantized training across RoBERTa-Large, OPT, and LLaMA-2 models, while achieving substantial memory savings (up to several-fold) and enabling LoRA-based, parameter-efficient fine-tuning. The analysis includes gradient-quality assessments, memory-efficiency comparisons, and hybrid datatype strategies to further enhance performance, indicating practical viability for on-device or resource-constrained fine-tuning of ultra-large models.

Abstract

Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks. Traditional fine-tuning methods such as stochastic gradient descent and Adam optimization require backpropagation, which are error-prone in the low-precision settings. To overcome these limitations, we propose the Quantized Zeroth-Order (QuZO) framework, specifically designed for fine-tuning LLMs through low-precision (e.g., 4- or 8-bit) forward passes. Our method can avoid the error-prone low-precision straight-through estimator, and utilizes optimized stochastic rounding to mitigate the increased bias. QuZO simplifies the training process, while achieving results comparable to first-order methods in ${\rm FP}8$ and superior accuracy in ${\rm INT}8$ and ${\rm INT}4$ training. Experiments demonstrate that low-bit training QuZO achieves performance comparable to MeZO optimization on GLUE, Multi-Choice, and Generation tasks, while reducing memory cost by $2.94 \times$ in LLaMA2-7B fine-tuning compared to quantized first-order methods.

QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

TL;DR

This work tackles the challenge of fine-tuning large language models on low-precision hardware without backpropagation. It introduces QuZO, a Quantized Zeroth-Order optimizer that uses a novel two-perturbation, stochastic-quantization gradient estimator to perform forward-only updates with - or -bit precision, avoiding the straight-through estimator. The method demonstrates superior or competitive accuracy compared to first-order quantized training across RoBERTa-Large, OPT, and LLaMA-2 models, while achieving substantial memory savings (up to several-fold) and enabling LoRA-based, parameter-efficient fine-tuning. The analysis includes gradient-quality assessments, memory-efficiency comparisons, and hybrid datatype strategies to further enhance performance, indicating practical viability for on-device or resource-constrained fine-tuning of ultra-large models.

Abstract

Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks. Traditional fine-tuning methods such as stochastic gradient descent and Adam optimization require backpropagation, which are error-prone in the low-precision settings. To overcome these limitations, we propose the Quantized Zeroth-Order (QuZO) framework, specifically designed for fine-tuning LLMs through low-precision (e.g., 4- or 8-bit) forward passes. Our method can avoid the error-prone low-precision straight-through estimator, and utilizes optimized stochastic rounding to mitigate the increased bias. QuZO simplifies the training process, while achieving results comparable to first-order methods in and superior accuracy in and training. Experiments demonstrate that low-bit training QuZO achieves performance comparable to MeZO optimization on GLUE, Multi-Choice, and Generation tasks, while reducing memory cost by in LLaMA2-7B fine-tuning compared to quantized first-order methods.

Paper Structure

This paper contains 35 sections, 19 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: The proposed QuZO provides higher fine-tuning accuracy than first-order (FO) methods in ultra-low precision on the RoBERTa-Large model.
  • Figure 2: Computational graphs for quantized first-order (FO) and zeroth-order (ZO) training.
  • Figure 3: (a) Errors of quantized gradient estimation Q-RGE1 in Eq. \ref{['equation:mezo_quant']} and our proposed Q-RGE2 in Eq. \ref{['equation:qzo']}. (b) Training loss of low-precision ZO optimizer with these two quantized gradient estimators, respectively.
  • Figure 4: Experimental findings on RoBERTa-large (350M parameters) with prompts reveal that QuZO, leveraging full-parameter tuning, starts to surpass FO and LLM-QAT as precision reduces to ${\rm INT}8$ or below.
  • Figure 5: Peak memory usage of FP16 and INT8 training on the OPT 1.3B/2.7B model with sequence lengths of 512 (left) and 1024 (right).
  • ...and 1 more figures