Table of Contents
Fetching ...

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides, Yiren Zhao

TL;DR

This work explores the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets, and adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers.

Abstract

The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a $19\times$ higher arithmetic density and $5\times$ memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by $2.5\times$ in arithmetic density and $1.2\times$ in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced.

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

TL;DR

This work explores the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets, and adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers.

Abstract

The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a higher arithmetic density and memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by in arithmetic density and in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced.
Paper Structure (42 sections, 4 equations, 10 figures, 8 tables, 2 algorithms)

This paper contains 42 sections, 4 equations, 10 figures, 8 tables, 2 algorithms.

Figures (10)

  • Figure 1: Transformer layer
  • Figure 2: An illustration of different quantisation methods considered in this work: MiniFloat sun2019hybrid and Denormed MiniFloat (DMF), Block MiniFloat (BM) fox2021block, Block Floating-Point (BFP) darvish2020pushing and Block Logarithm (BL).
  • Figure 3: The bit width distribution of $\bf{Q}$ in Line 6, \ref{['alg:transformer']} from 2688 searches. We identify the layers less tolerant to aggressive quantisation in OPT-2.7B. For example, layers 18, 25 and 30 often need more bits than other layers. Keeping these layers in relatively high precision recovers the accuracy from 36.2% to 61.3% without decreasing the memory density, equivalent to a 4.3-bit OPT-2.7B on average.
  • Figure 4: Transformer layer (Vicuna)
  • Figure 5: We demonstrate a similar analysis to \ref{['fig:introduction:motivation']}, where on the left we have OPT-350M variance vs layer ID and OPT-2.7B variance vs layer ID on the right. The trend of increasing activation variance is more obvious on larger models.
  • ...and 5 more figures