Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Cheng Zhang; Jianyi Cheng; Ilia Shumailov; George A. Constantinides; Yiren Zhao

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides, Yiren Zhao

TL;DR

This work explores the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets, and adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers.

Abstract

The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a $19\times$ higher arithmetic density and $5\times$ memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by $2.5\times$ in arithmetic density and $1.2\times$ in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced.

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

TL;DR

Abstract

higher arithmetic density and

memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by

in arithmetic density and

in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced.

Paper Structure (42 sections, 4 equations, 10 figures, 8 tables, 2 algorithms)

This paper contains 42 sections, 4 equations, 10 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Block-based Quantisation
LLM Quantisation
Method
Block-based Arithmetic
Standard floating-point
MiniFloat and Denormalised MiniFloat
Block MiniFloat, Block Floating-Point and Block Logarithm
Arithmetic and Memory Densities
Quantisation Search
Evaluation
Experiment setup
Baselines
Quantisation configuration
...and 27 more sections

Figures (10)

Figure 1: Transformer layer
Figure 2: An illustration of different quantisation methods considered in this work: MiniFloat sun2019hybrid and Denormed MiniFloat (DMF), Block MiniFloat (BM) fox2021block, Block Floating-Point (BFP) darvish2020pushing and Block Logarithm (BL).
Figure 3: The bit width distribution of $\bf{Q}$ in Line 6, \ref{['alg:transformer']} from 2688 searches. We identify the layers less tolerant to aggressive quantisation in OPT-2.7B. For example, layers 18, 25 and 30 often need more bits than other layers. Keeping these layers in relatively high precision recovers the accuracy from 36.2% to 61.3% without decreasing the memory density, equivalent to a 4.3-bit OPT-2.7B on average.
Figure 4: Transformer layer (Vicuna)
Figure 5: We demonstrate a similar analysis to \ref{['fig:introduction:motivation']}, where on the left we have OPT-350M variance vs layer ID and OPT-2.7B variance vs layer ID on the right. The trend of increasing activation variance is more obvious on larger models.
...and 5 more figures

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

TL;DR

Abstract

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Authors

TL;DR

Abstract

Table of Contents

Figures (10)