Table of Contents
Fetching ...

Error Diffusion: Post Training Quantization with Block-Scaled Number Formats for Neural Networks

Alireza Khodamoradi, Kristof Denolf, Eric Dellinger

TL;DR

This paper describes how to improve the quantization process by viewing the neural model as a composite function and diffusing the quantization error in every layer, and introduces TensorCast, an open-source library based on PyTorch to emulate a variety of number formats, including the block-scaled ones, to aid the research in neural model quantization.

Abstract

Quantization reduces the model's hardware costs, such as data movement, storage, and operations like multiply and addition. It also affects the model's behavior by degrading the output quality. Therefore, there is a need for methods that preserve the model's behavior when quantizing model parameters. More exotic numerical encodings, such as block-scaled number formats, have shown advantages for utilizing a fixed bit budget to encode model parameters. This paper presents error diffusion (ED), a hyperparameter-free method for post-training quantization with support for block-scaled data formats. Our approach does not rely on backpropagation or Hessian information. We describe how to improve the quantization process by viewing the neural model as a composite function and diffusing the quantization error in every layer. In addition, we introduce TensorCast, an open-source library based on PyTorch to emulate a variety of number formats, including the block-scaled ones, to aid the research in neural model quantization. We demonstrate the efficacy of our algorithm through rigorous testing on various architectures, including vision and large language models (LLMs), where it consistently delivers competitive results. Our experiments confirm that block-scaled data formats provide a robust choice for post-training quantization and could be used effectively to enhance the practical deployment of advanced neural networks.

Error Diffusion: Post Training Quantization with Block-Scaled Number Formats for Neural Networks

TL;DR

This paper describes how to improve the quantization process by viewing the neural model as a composite function and diffusing the quantization error in every layer, and introduces TensorCast, an open-source library based on PyTorch to emulate a variety of number formats, including the block-scaled ones, to aid the research in neural model quantization.

Abstract

Quantization reduces the model's hardware costs, such as data movement, storage, and operations like multiply and addition. It also affects the model's behavior by degrading the output quality. Therefore, there is a need for methods that preserve the model's behavior when quantizing model parameters. More exotic numerical encodings, such as block-scaled number formats, have shown advantages for utilizing a fixed bit budget to encode model parameters. This paper presents error diffusion (ED), a hyperparameter-free method for post-training quantization with support for block-scaled data formats. Our approach does not rely on backpropagation or Hessian information. We describe how to improve the quantization process by viewing the neural model as a composite function and diffusing the quantization error in every layer. In addition, we introduce TensorCast, an open-source library based on PyTorch to emulate a variety of number formats, including the block-scaled ones, to aid the research in neural model quantization. We demonstrate the efficacy of our algorithm through rigorous testing on various architectures, including vision and large language models (LLMs), where it consistently delivers competitive results. Our experiments confirm that block-scaled data formats provide a robust choice for post-training quantization and could be used effectively to enhance the practical deployment of advanced neural networks.

Paper Structure

This paper contains 12 sections, 15 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Model $\Phi=f^{(4)}\circ f^{(3)}\circ(f^{(2)}\circ f^{(1)}, f^{(1)})$ as a composite function with dependencies described by its DAG. Each function's connections is color-coded to show its inputs and output.
  • Figure 2: Bit layout for four values encoded in int4 (sign: 1 bit and mantissa: 3 bits), fp4 (sign: 1 bit, exponent: 2 bits, and mantissa: 1 bit), and b4int3 (sign: 1 bit, mantissa: 2 bits, and 4-bit scale).
  • Figure 3: Top left: output, $O_{M\times OFM}$, generated by the matrix multiply of the input activations and weights. Top middle: each element of the output $o_{i,j}$ is the sum of products of two vectors with size $1\times IFM$. Bottom middle: $k$th portion of the output matrix is the outer product of two vectors with sizes $M\times 1$ and $1\times OFM$. Left bottom: output is the sum of its $IFM$ portions. Top Right: a block of $s$ numbers in $(W_{j,:})^T$ column. Bottom right: same block of $s$ numbers expands over $s$ rows $(W_{:,k:k+s})^T$ and effects $s$ output portions.