Table of Contents
Fetching ...

Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

Peiyu Liu, Zikang Liu, Ze-Feng Gao, Dawei Gao, Wayne Xin Zhao, Yaliang Li, Bolin Ding, Ji-Rong Wen

TL;DR

The paper investigates how post-training quantization affects emergent abilities—specifically in-context learning, chain-of-thought reasoning, and instruction following—in large language models. Using GPTQ-based 4-bit and 2-bit quantization across LLaMA backbones, it finds that 4-bit weights largely preserve emergent abilities while 2-bit quantization severely degrades performance, particularly for CoT and IF. It identifies FFN components and outlier activations as key sources of quantization sensitivity and demonstrates that fine-grained substructure quantization and post-quantization fine-tuning (including LoRA_q) can substantially recover performance for very low-bit models. The findings provide practical guidance for memory-efficient deployment of LLMs and highlight avenues for achieving extremely low-bit quantization without sacrificing core emergent capabilities.

Abstract

Despite the superior performance, Large Language Models~(LLMs) require significant computational resources for deployment and use. To overcome this issue, quantization methods have been widely applied to reduce the memory footprint of LLMs as well as increasing the inference rate. However, a major challenge is that low-bit quantization methods often lead to performance degradation. It is important to understand how quantization impacts the capacity of LLMs. Different from previous studies focused on overall performance, this work aims to investigate the impact of quantization on \emph{emergent abilities}, which are important characteristics that distinguish LLMs from small language models. Specially, we examine the abilities of in-context learning, chain-of-thought reasoning, and instruction-following in quantized LLMs. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation on the test of these abilities. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning. Our work derives a series of important findings to understand the impact of quantization on emergent abilities, and sheds lights on the possibilities of extremely low-bit quantization for LLMs.

Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

TL;DR

The paper investigates how post-training quantization affects emergent abilities—specifically in-context learning, chain-of-thought reasoning, and instruction following—in large language models. Using GPTQ-based 4-bit and 2-bit quantization across LLaMA backbones, it finds that 4-bit weights largely preserve emergent abilities while 2-bit quantization severely degrades performance, particularly for CoT and IF. It identifies FFN components and outlier activations as key sources of quantization sensitivity and demonstrates that fine-grained substructure quantization and post-quantization fine-tuning (including LoRA_q) can substantially recover performance for very low-bit models. The findings provide practical guidance for memory-efficient deployment of LLMs and highlight avenues for achieving extremely low-bit quantization without sacrificing core emergent capabilities.

Abstract

Despite the superior performance, Large Language Models~(LLMs) require significant computational resources for deployment and use. To overcome this issue, quantization methods have been widely applied to reduce the memory footprint of LLMs as well as increasing the inference rate. However, a major challenge is that low-bit quantization methods often lead to performance degradation. It is important to understand how quantization impacts the capacity of LLMs. Different from previous studies focused on overall performance, this work aims to investigate the impact of quantization on \emph{emergent abilities}, which are important characteristics that distinguish LLMs from small language models. Specially, we examine the abilities of in-context learning, chain-of-thought reasoning, and instruction-following in quantized LLMs. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation on the test of these abilities. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning. Our work derives a series of important findings to understand the impact of quantization on emergent abilities, and sheds lights on the possibilities of extremely low-bit quantization for LLMs.
Paper Structure (39 sections, 4 figures, 8 tables)

This paper contains 39 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Performance comparison of quantized models under varied memory costs. For AutoEval, the term "Relative Score" denotes the score ratio between quantized models and GPT3.5. The $x$-axis denotes the total number of bits after quantization.
  • Figure 2: Impacts of different model components or substructures on MMLU (five-shot). The memory footprint is counted in GiB (in green dotted lines).
  • Figure 3: Impacts of feature outliers on LLaMA models (7B and 13B). "non-outlier" denotes the quantization on all non-outlier dimensions, and "+top-1" and "+top-3" refer to quantization of the top-1 and top-3 outlier dimensions in addition to the non-outlier dimensions. "$\downarrow$" indicates that lower indicators are better.
  • Figure 4: Impacts of different model components or substructures on MMLU (5-shot), GSM8K and WikiText. The memory footprint is counted in GiB (in green dotted lines).