The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization

Minghai Qin

The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization

Minghai Qin

TL;DR

Results indicate that the weight distribution of the LLaMA3-70B is the primary factor behind the vulnerability of the LLaMA3-70B model series under W8A8 quantization, and proposes a mixed strategy where less than 3\% of the layers employ finer per-group W8A8 quantization granularity.

Abstract

We have observed a distinctive quantization-related behavior in the LLaMA3/3.1-70B models that is absent in both the LLaMA2-70B and LLaMA3/3.1/3.2-1B/3B/8B/405B models. Quantization is a crucial technique for deploying large language models (LLMs) efficiently. The impact of W8A8 post-training quantization on model accuracy, especially on the recently released LLaMA3/3.1 model series, remains contentious. In this paper, we explore three key questions: What makes the LLaMA3-70B model series uniquely vulnerable to quantization? Why is this the case? And how can the issue be addressed? We empirically investigate multiple LLMs featured on an open LLM leaderboard, discovering that the LLaMA3-70B model series have a unique accuracy degradation behavior with W8A8 per-channel post-training quantization. In contrast, other model series such as LLaMA2, LLaMA3/3.1-8B, LLaMA3.2, Qwen, Mixtral, Mistral, Phi-3, and Falcon demonstrate robust performance with W8A8. Contrary to previous assertions attributing degradation to the large dynamic range of activations, our findings indicate that the weight distribution of the LLaMA3-70B is the primary factor behind the vulnerability. By meticulously analyzing the distinct characteristics of weight distributions across Transformer blocks, we propose two solutions that make different tradeoffs in hardware/software overhead. First, we propose a mixed strategy where less than 3\% of the layers employ finer per-group W8A8 quantization granularity. Second, we introduce a bi-smoothing strategy that balances quantization errors between weights and activations while maintaining per-channel quantization throughout. Experimental results demonstrate that both strategies effectively preserve the accuracy of the entire LLaMA3-70B model series under W8A8 quantization, achieving performance on par with their FP16 counterparts.

The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization

TL;DR

Abstract

Paper Structure (35 sections, 1 theorem, 12 equations, 41 figures, 1 table)

This paper contains 35 sections, 1 theorem, 12 equations, 41 figures, 1 table.

Introduction
Quantization Basics
The Uniqueness of LLaMA3-70B with per-channel W8A8
Test Dataset
Tested Models
Test Environments
LLaMA3-70B's sensitivity to W8A8
LLaMA3-70B is the ONLY ONE sensitive to W8A8
Why LLaMA3-70B is the only one?
A qualitative exploration
A quantitative exploration
A Mixed Grouping Strategy
Main experimental results on mixed grouping
Ablation on the group size
A Bi-smoothing Strategy
...and 20 more sections

Key Result

Theorem 5.1

Let $S\in \mathbb{R}^{N}$ be a non-zero vector. We denote $W_s$ and $A_s$ the column-wise (last dimension) smoothed tensors, respectively, such that and Then the following equation holds,

Figures (41)

Figure 1: Accuracy of LLaMA3-70B model series with W16A16, and W8A16 per-channel quantization
Figure 2: abs_weights of the "V" matrix in the first Transformer block of LLaMA3-70B
Figure 3: W8A8 per-channel quantization accuracy of models using LLaMA3-70B as the base model
Figure 4: Accuracy of other LLaMA models with W8A8 per-channel quantization
Figure 5: LLaMA3.2 models with W8A8 per-channel quantization
...and 36 more figures

Theorems & Definitions (2)

Theorem 5.1
proof

The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization

TL;DR

Abstract

The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (41)

Theorems & Definitions (2)