Table of Contents
Fetching ...

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

Baohao Liao, Christian Herold, Shahram Khadivi, Christof Monz

TL;DR

A novel quantization framework named ApiQ is introduced, designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs, which ensures the maintenance of the original LLM's activation precision while mitigating the error propagation from shallower into deeper layers.

Abstract

Memory-efficient finetuning of large language models (LLMs) has recently attracted huge attention with the increasing size of LLMs, primarily due to the constraints posed by GPU memory limitations and the effectiveness of these methods compared to full finetuning. Despite the advancements, current strategies for memory-efficient finetuning, such as QLoRA, exhibit inconsistent performance across diverse bit-width quantizations and multifaceted tasks. This inconsistency largely stems from the detrimental impact of the quantization process on preserved knowledge, leading to catastrophic forgetting and undermining the utilization of pretrained models for finetuning purposes. In this work, we introduce a novel quantization framework, ApiQ, designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs. This approach ensures the maintenance of the original LLM's activation precision while mitigating the error propagation from shallower into deeper layers. Through comprehensive evaluations conducted on a spectrum of language tasks with various LLMs, ApiQ demonstrably minimizes activation error during quantization. Consequently, it consistently achieves superior finetuning results across various bit-widths.

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

TL;DR

A novel quantization framework named ApiQ is introduced, designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs, which ensures the maintenance of the original LLM's activation precision while mitigating the error propagation from shallower into deeper layers.

Abstract

Memory-efficient finetuning of large language models (LLMs) has recently attracted huge attention with the increasing size of LLMs, primarily due to the constraints posed by GPU memory limitations and the effectiveness of these methods compared to full finetuning. Despite the advancements, current strategies for memory-efficient finetuning, such as QLoRA, exhibit inconsistent performance across diverse bit-width quantizations and multifaceted tasks. This inconsistency largely stems from the detrimental impact of the quantization process on preserved knowledge, leading to catastrophic forgetting and undermining the utilization of pretrained models for finetuning purposes. In this work, we introduce a novel quantization framework, ApiQ, designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs. This approach ensures the maintenance of the original LLM's activation precision while mitigating the error propagation from shallower into deeper layers. Through comprehensive evaluations conducted on a spectrum of language tasks with various LLMs, ApiQ demonstrably minimizes activation error during quantization. Consequently, it consistently achieves superior finetuning results across various bit-widths.
Paper Structure (34 sections, 7 equations, 11 figures, 15 tables, 1 algorithm)

This paper contains 34 sections, 7 equations, 11 figures, 15 tables, 1 algorithm.

Figures (11)

  • Figure 1: Finetuning performance over various tasks. 1st row: LLM is DeBERTa-v3-base for GLUE and Llama-2-7B for the rest. 2nd row: LLM is RoBERTa-large for GLUE and Llama-2-13B for the rest. For better visualization, some extremely worse results from (2-bit or 3-bit) QLoRA are ignored.
  • Figure 2: Memory allocation (GB) of a A100-80GB GPU for finetuning Llama-2-7B. The optimizer is Adam. The batch size is 1. The sequence length is 2048. For QLoRA, the bit-width is 4 and the LoRA rank is 64.
  • Figure 3: Relative weight quantization error of 2-bit quantized Llama-2-7B, i.e. $e = \lVert\delta W^{\text{baseline}}\rVert_F - \lVert\delta W^{\text{method}} \rVert_F$. The larger $e$ is, the more effective the method is in reducing weight error compared to the baseline. Left: The method is LoftQ and the baseline is QLoRA. Middle: The method is ApiQ and the baseline is QLoRA. Right: The method is ApiQ and the baseline is LoftQ. Refer to Figure \ref{['fig: weight diff']} for the 2-bit and 4-bit non-relative weight error.
  • Figure 4: The average activation error $\lVert\bm{X}\bm{W} - \bm{X}^q(\bm{Q} + \bm{AB}^\top)\rVert_F$ per token for different linear layers of Llama-2-7B. 1st column: The activation error for every transformer block. We randomly sample 128 sentences from C4 to obtain the activations. For better visualization, some lines are divided by a factor, denoted as "/ factor". Please pay attention to the scale of the y-axis to compare different methods. ApiQ has the smallest activation error.
  • Figure 5: Histogram of $\bm{Q}$, $\bm{A}$ and $\bm{B}$ for the 2-bit quantized output projection layer in the 30$^\mathrm{th}$ block of Llama-2-7B. Left: LoftQ. Right: ApiQ-lw. Refer to Figure \ref{['fig: hist 4bit']}, \ref{['fig: hist 3bit']}, \ref{['fig: hist 2bit']} and \ref{['fig: hist loftq 2bit']} for all layers.
  • ...and 6 more figures