Table of Contents
Fetching ...

The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization

Minghai Qin

TL;DR

Results indicate that the weight distribution of the LLaMA3-70B is the primary factor behind the vulnerability of the LLaMA3-70B model series under W8A8 quantization, and proposes a mixed strategy where less than 3\% of the layers employ finer per-group W8A8 quantization granularity.

Abstract

We have observed a distinctive quantization-related behavior in the LLaMA3/3.1-70B models that is absent in both the LLaMA2-70B and LLaMA3/3.1/3.2-1B/3B/8B/405B models. Quantization is a crucial technique for deploying large language models (LLMs) efficiently. The impact of W8A8 post-training quantization on model accuracy, especially on the recently released LLaMA3/3.1 model series, remains contentious. In this paper, we explore three key questions: What makes the LLaMA3-70B model series uniquely vulnerable to quantization? Why is this the case? And how can the issue be addressed? We empirically investigate multiple LLMs featured on an open LLM leaderboard, discovering that the LLaMA3-70B model series have a unique accuracy degradation behavior with W8A8 per-channel post-training quantization. In contrast, other model series such as LLaMA2, LLaMA3/3.1-8B, LLaMA3.2, Qwen, Mixtral, Mistral, Phi-3, and Falcon demonstrate robust performance with W8A8. Contrary to previous assertions attributing degradation to the large dynamic range of activations, our findings indicate that the weight distribution of the LLaMA3-70B is the primary factor behind the vulnerability. By meticulously analyzing the distinct characteristics of weight distributions across Transformer blocks, we propose two solutions that make different tradeoffs in hardware/software overhead. First, we propose a mixed strategy where less than 3\% of the layers employ finer per-group W8A8 quantization granularity. Second, we introduce a bi-smoothing strategy that balances quantization errors between weights and activations while maintaining per-channel quantization throughout. Experimental results demonstrate that both strategies effectively preserve the accuracy of the entire LLaMA3-70B model series under W8A8 quantization, achieving performance on par with their FP16 counterparts.

The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization

TL;DR

Results indicate that the weight distribution of the LLaMA3-70B is the primary factor behind the vulnerability of the LLaMA3-70B model series under W8A8 quantization, and proposes a mixed strategy where less than 3\% of the layers employ finer per-group W8A8 quantization granularity.

Abstract

We have observed a distinctive quantization-related behavior in the LLaMA3/3.1-70B models that is absent in both the LLaMA2-70B and LLaMA3/3.1/3.2-1B/3B/8B/405B models. Quantization is a crucial technique for deploying large language models (LLMs) efficiently. The impact of W8A8 post-training quantization on model accuracy, especially on the recently released LLaMA3/3.1 model series, remains contentious. In this paper, we explore three key questions: What makes the LLaMA3-70B model series uniquely vulnerable to quantization? Why is this the case? And how can the issue be addressed? We empirically investigate multiple LLMs featured on an open LLM leaderboard, discovering that the LLaMA3-70B model series have a unique accuracy degradation behavior with W8A8 per-channel post-training quantization. In contrast, other model series such as LLaMA2, LLaMA3/3.1-8B, LLaMA3.2, Qwen, Mixtral, Mistral, Phi-3, and Falcon demonstrate robust performance with W8A8. Contrary to previous assertions attributing degradation to the large dynamic range of activations, our findings indicate that the weight distribution of the LLaMA3-70B is the primary factor behind the vulnerability. By meticulously analyzing the distinct characteristics of weight distributions across Transformer blocks, we propose two solutions that make different tradeoffs in hardware/software overhead. First, we propose a mixed strategy where less than 3\% of the layers employ finer per-group W8A8 quantization granularity. Second, we introduce a bi-smoothing strategy that balances quantization errors between weights and activations while maintaining per-channel quantization throughout. Experimental results demonstrate that both strategies effectively preserve the accuracy of the entire LLaMA3-70B model series under W8A8 quantization, achieving performance on par with their FP16 counterparts.
Paper Structure (35 sections, 1 theorem, 12 equations, 41 figures, 1 table)

This paper contains 35 sections, 1 theorem, 12 equations, 41 figures, 1 table.

Key Result

Theorem 5.1

Let $S\in \mathbb{R}^{N}$ be a non-zero vector. We denote $W_s$ and $A_s$ the column-wise (last dimension) smoothed tensors, respectively, such that and Then the following equation holds,

Figures (41)

  • Figure 1: Accuracy of LLaMA3-70B model series with W16A16, and W8A16 per-channel quantization
  • Figure 2: abs_weights of the "V" matrix in the first Transformer block of LLaMA3-70B
  • Figure 3: W8A8 per-channel quantization accuracy of models using LLaMA3-70B as the base model
  • Figure 4: Accuracy of other LLaMA models with W8A8 per-channel quantization
  • Figure 5: LLaMA3.2 models with W8A8 per-channel quantization
  • ...and 36 more figures

Theorems & Definitions (2)

  • Theorem 5.1
  • proof