Table of Contents
Fetching ...

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, Zhihao Jia

TL;DR

Quantized Side Tuning (QST) targets the three main memory bottlenecks in finetuning large language models: model weights, optimizer states, and intermediate activations. It combines 4-bit weight quantization with a side network that bypasses backpropagation through the LLM, using downsample modules to drastically reduce trainable parameters and optimizer memory while preserving accuracy. Empirical results across GLUE, MMLU, and chatbot benchmarks show up to about 2.3× total memory reduction and up to 3× faster finetuning, with even larger gains for full finetuning (up to ~7× memory reduction). The approach scales to 1.3B–70B models (OPT and LLaMA-2) and maintains competitive or superior performance in many tasks, suggesting practical applicability for memory-constrained finetuning of very large models.

Abstract

Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from three contributors: model weights, optimizer states, and intermediate activations. However, existing works still require considerable memory and none can simultaneously mitigate memory footprint for all three sources. In this paper, we present Quantized Side Tuing (QST), which enables memory-efficient and fast finetuning of LLMs by operating through a dual-stage process. First, QST quantizes an LLM's model weights into 4-bit to reduce the memory footprint of the LLM's original weights; QST also introduces a side network separated from the LLM, which utilizes the hidden states of the LLM to make task-specific predictions. Using a separate side network avoids performing backpropagation through the LLM, thus reducing the memory requirement of the intermediate activations. Furthermore, QST leverages several low-rank adaptors and gradient-free downsample modules to significantly reduce the trainable parameters, so as to save the memory footprint of the optimizer states. Experiments show that QST can reduce the total memory footprint by up to 2.3 $\times$ and speed up the finetuning process by up to 3 $\times$ while achieving competent performance compared with the state-of-the-art. When it comes to full finetuning, QST can reduce the total memory footprint up to 7 $\times$.

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

TL;DR

Quantized Side Tuning (QST) targets the three main memory bottlenecks in finetuning large language models: model weights, optimizer states, and intermediate activations. It combines 4-bit weight quantization with a side network that bypasses backpropagation through the LLM, using downsample modules to drastically reduce trainable parameters and optimizer memory while preserving accuracy. Empirical results across GLUE, MMLU, and chatbot benchmarks show up to about 2.3× total memory reduction and up to 3× faster finetuning, with even larger gains for full finetuning (up to ~7× memory reduction). The approach scales to 1.3B–70B models (OPT and LLaMA-2) and maintains competitive or superior performance in many tasks, suggesting practical applicability for memory-constrained finetuning of very large models.

Abstract

Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from three contributors: model weights, optimizer states, and intermediate activations. However, existing works still require considerable memory and none can simultaneously mitigate memory footprint for all three sources. In this paper, we present Quantized Side Tuing (QST), which enables memory-efficient and fast finetuning of LLMs by operating through a dual-stage process. First, QST quantizes an LLM's model weights into 4-bit to reduce the memory footprint of the LLM's original weights; QST also introduces a side network separated from the LLM, which utilizes the hidden states of the LLM to make task-specific predictions. Using a separate side network avoids performing backpropagation through the LLM, thus reducing the memory requirement of the intermediate activations. Furthermore, QST leverages several low-rank adaptors and gradient-free downsample modules to significantly reduce the trainable parameters, so as to save the memory footprint of the optimizer states. Experiments show that QST can reduce the total memory footprint by up to 2.3 and speed up the finetuning process by up to 3 while achieving competent performance compared with the state-of-the-art. When it comes to full finetuning, QST can reduce the total memory footprint up to 7 .
Paper Structure (29 sections, 2 equations, 6 figures, 9 tables)

This paper contains 29 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Figure \ref{['fig:mem_comp_70b']} shows the memory footprint of different methods of fintuning LLaMA-2-70b. Figure \ref{['fig:acc_comp_7ob']} shows the MMLU 5-shot accuracy of different methods when tuning LLaMA-2-7B, LLaMA-2-13B, and LLaMA-2-70B. Note that we set the batch size to 16 and the sequence length to 384. Larger markers represent larger models.
  • Figure 2: A overview of quantized side tuning.
  • Figure 3: Illustration of $i^{th}$ layer of QST.
  • Figure 4: Effects of the batch size, total model bits, and sequence length on memory footprint.
  • Figure 5: Effects of the reduction factor $r$ on MMLU accuracy, memory footprint, and training throughput.
  • ...and 1 more figures