IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers
Zhanpeng Zeng, Karthikeyan Sankaralingam, Vikas Singh
TL;DR
This work tackles the dominant compute cost of GEMMs in Transformers by asking whether exact integer GEMMs at arbitrarily low precision can match floating-point performance for both training and inference. It shows that while most matrix entries fit into low-bit representations, a handful of heavy hitters hinder pure low-bit GEMMs, motivating IM-Unpack, which unpacks large integers into a larger representation whose entries lie in the low-bit range yet preserves exact results after appropriate scaling. IM-Unpack combines row, column, and mixed unpacking strategies to enable exact GEMMs using only low-bit arithmetic, with overheads that are typically modest across common Transformer models. Empirically, RTN with sensible β values often nears FP performance on several models, and IM-Unpack provides a practical path to hardware that supports a single low bit-width while handling outliers implicitly, potentially enabling more power-efficient training and inference.
Abstract
GEneral Matrix Multiply (GEMM) is a central operation in deep learning and corresponds to the largest chunk of the compute footprint. Therefore, improving its efficiency is an active topic of ongoing research. A popular strategy is the use of low bit-width integers to approximate the original entries in a matrix. This allows efficiency gains, but often requires sophisticated techniques to control the rounding error incurred. In this work, we first verify/check that when the low bit-width restriction is removed, for a variety of Transformer-based models, whether integers are sufficient for all GEMMs need -- for {\em both} training and inference stages, and can achieve parity with floating point counterparts. No sophisticated techniques are needed. We find that while a large majority of entries in matrices (encountered in such models) can be easily represented by {\em low} bit-width integers, the existence of a few heavy hitter entries make it difficult to achieve efficiency gains via the exclusive use of low bit-width GEMMs alone. To address this issue, we develop a simple algorithm, Integer Matrix Unpacking (IM-Unpack), to {\em unpack} a matrix with large integer entries into a larger matrix whose entries all lie within the representable range of arbitrarily low bit-width integers. This allows {\em equivalence} with the original GEMM, i.e., the exact result can be obtained using purely low bit-width integer GEMMs. This comes at the cost of additional operations -- we show that for many popular models, this overhead is quite small.
