Table of Contents
Fetching ...

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

Zhanpeng Zeng, Karthikeyan Sankaralingam, Vikas Singh

TL;DR

This work tackles the dominant compute cost of GEMMs in Transformers by asking whether exact integer GEMMs at arbitrarily low precision can match floating-point performance for both training and inference. It shows that while most matrix entries fit into low-bit representations, a handful of heavy hitters hinder pure low-bit GEMMs, motivating IM-Unpack, which unpacks large integers into a larger representation whose entries lie in the low-bit range yet preserves exact results after appropriate scaling. IM-Unpack combines row, column, and mixed unpacking strategies to enable exact GEMMs using only low-bit arithmetic, with overheads that are typically modest across common Transformer models. Empirically, RTN with sensible β values often nears FP performance on several models, and IM-Unpack provides a practical path to hardware that supports a single low bit-width while handling outliers implicitly, potentially enabling more power-efficient training and inference.

Abstract

GEneral Matrix Multiply (GEMM) is a central operation in deep learning and corresponds to the largest chunk of the compute footprint. Therefore, improving its efficiency is an active topic of ongoing research. A popular strategy is the use of low bit-width integers to approximate the original entries in a matrix. This allows efficiency gains, but often requires sophisticated techniques to control the rounding error incurred. In this work, we first verify/check that when the low bit-width restriction is removed, for a variety of Transformer-based models, whether integers are sufficient for all GEMMs need -- for {\em both} training and inference stages, and can achieve parity with floating point counterparts. No sophisticated techniques are needed. We find that while a large majority of entries in matrices (encountered in such models) can be easily represented by {\em low} bit-width integers, the existence of a few heavy hitter entries make it difficult to achieve efficiency gains via the exclusive use of low bit-width GEMMs alone. To address this issue, we develop a simple algorithm, Integer Matrix Unpacking (IM-Unpack), to {\em unpack} a matrix with large integer entries into a larger matrix whose entries all lie within the representable range of arbitrarily low bit-width integers. This allows {\em equivalence} with the original GEMM, i.e., the exact result can be obtained using purely low bit-width integer GEMMs. This comes at the cost of additional operations -- we show that for many popular models, this overhead is quite small.

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

TL;DR

This work tackles the dominant compute cost of GEMMs in Transformers by asking whether exact integer GEMMs at arbitrarily low precision can match floating-point performance for both training and inference. It shows that while most matrix entries fit into low-bit representations, a handful of heavy hitters hinder pure low-bit GEMMs, motivating IM-Unpack, which unpacks large integers into a larger representation whose entries lie in the low-bit range yet preserves exact results after appropriate scaling. IM-Unpack combines row, column, and mixed unpacking strategies to enable exact GEMMs using only low-bit arithmetic, with overheads that are typically modest across common Transformer models. Empirically, RTN with sensible β values often nears FP performance on several models, and IM-Unpack provides a practical path to hardware that supports a single low bit-width while handling outliers implicitly, potentially enabling more power-efficient training and inference.

Abstract

GEneral Matrix Multiply (GEMM) is a central operation in deep learning and corresponds to the largest chunk of the compute footprint. Therefore, improving its efficiency is an active topic of ongoing research. A popular strategy is the use of low bit-width integers to approximate the original entries in a matrix. This allows efficiency gains, but often requires sophisticated techniques to control the rounding error incurred. In this work, we first verify/check that when the low bit-width restriction is removed, for a variety of Transformer-based models, whether integers are sufficient for all GEMMs need -- for {\em both} training and inference stages, and can achieve parity with floating point counterparts. No sophisticated techniques are needed. We find that while a large majority of entries in matrices (encountered in such models) can be easily represented by {\em low} bit-width integers, the existence of a few heavy hitter entries make it difficult to achieve efficiency gains via the exclusive use of low bit-width GEMMs alone. To address this issue, we develop a simple algorithm, Integer Matrix Unpacking (IM-Unpack), to {\em unpack} a matrix with large integer entries into a larger matrix whose entries all lie within the representable range of arbitrarily low bit-width integers. This allows {\em equivalence} with the original GEMM, i.e., the exact result can be obtained using purely low bit-width integer GEMMs. This comes at the cost of additional operations -- we show that for many popular models, this overhead is quite small.
Paper Structure (17 sections, 18 equations, 9 figures, 17 tables, 5 algorithms)

This paper contains 17 sections, 18 equations, 9 figures, 17 tables, 5 algorithms.

Figures (9)

  • Figure 1: Overall Illustration. We verify the Efficacy of Integers (Contribution 1) in §\ref{['sec:integer']}, but note that the integer matrices contain heavy hitters (§\ref{['sec:issue_low_precision_int']}). Then, we describe our proposed algorithm, IM-Unpack (Contribution 2), to resolve these heavy hitters in §\ref{['sec:imunpack']}.
  • Figure 2: Training: Comparison of RoBERTa loss curves.
  • Figure 3: Training: Comparison of ViT-Small. $^\dagger$ and $^*$: we set $\beta = 16383$ and $\beta = 1023$, respectively, for the set $\{\nabla_{\mathbf{Y}}, \nabla_{\mathbf{P}}, \nabla_{\mathbf{O}}\}$.
  • Figure 4: Illustration of unpacking row vectors. The solid, dashed, and dotted arrows correspond to lines 5, 4, and 6 in Algo. \ref{['alg:unpack_row']}
  • Figure 5: Illustration of unpacking column vectors. The blue solid, dashed, and dotted arrows correspond to lines 5, 4, and 7 in Algo. \ref{['alg:unpack_row']}, and the gray dashed arrow corresponds to line 6 in Algo. \ref{['alg:unpack_row']}.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Remark 4.1