IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

Zhanpeng Zeng; Karthikeyan Sankaralingam; Vikas Singh

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

Zhanpeng Zeng, Karthikeyan Sankaralingam, Vikas Singh

TL;DR

This work tackles the dominant compute cost of GEMMs in Transformers by asking whether exact integer GEMMs at arbitrarily low precision can match floating-point performance for both training and inference. It shows that while most matrix entries fit into low-bit representations, a handful of heavy hitters hinder pure low-bit GEMMs, motivating IM-Unpack, which unpacks large integers into a larger representation whose entries lie in the low-bit range yet preserves exact results after appropriate scaling. IM-Unpack combines row, column, and mixed unpacking strategies to enable exact GEMMs using only low-bit arithmetic, with overheads that are typically modest across common Transformer models. Empirically, RTN with sensible β values often nears FP performance on several models, and IM-Unpack provides a practical path to hardware that supports a single low bit-width while handling outliers implicitly, potentially enabling more power-efficient training and inference.

Abstract

GEneral Matrix Multiply (GEMM) is a central operation in deep learning and corresponds to the largest chunk of the compute footprint. Therefore, improving its efficiency is an active topic of ongoing research. A popular strategy is the use of low bit-width integers to approximate the original entries in a matrix. This allows efficiency gains, but often requires sophisticated techniques to control the rounding error incurred. In this work, we first verify/check that when the low bit-width restriction is removed, for a variety of Transformer-based models, whether integers are sufficient for all GEMMs need -- for {\em both} training and inference stages, and can achieve parity with floating point counterparts. No sophisticated techniques are needed. We find that while a large majority of entries in matrices (encountered in such models) can be easily represented by {\em low} bit-width integers, the existence of a few heavy hitter entries make it difficult to achieve efficiency gains via the exclusive use of low bit-width GEMMs alone. To address this issue, we develop a simple algorithm, Integer Matrix Unpacking (IM-Unpack), to {\em unpack} a matrix with large integer entries into a larger matrix whose entries all lie within the representable range of arbitrarily low bit-width integers. This allows {\em equivalence} with the original GEMM, i.e., the exact result can be obtained using purely low bit-width integer GEMMs. This comes at the cost of additional operations -- we show that for many popular models, this overhead is quite small.

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

TL;DR

Abstract

Paper Structure (17 sections, 18 equations, 9 figures, 17 tables, 5 algorithms)

This paper contains 17 sections, 18 equations, 9 figures, 17 tables, 5 algorithms.

Introduction
Round to Nearest: What do we lose?
Efficacy of Integers: Inference
Efficacy of Integers: Training
What happens with Low Bit-Width?
IM-Unpack: Integer Matrix Unpacking
Variants of Matrix Unpacking
Evaluating Unpacking Overhead
Limitations
Conclusion
Appendices
Why Using Percentiles?
Baseline Comparison when Quantize Parameters Only
Details of Training Experiments
Unpack Ratios of ViT-Large
...and 2 more sections

Figures (9)

Figure 1: Overall Illustration. We verify the Efficacy of Integers (Contribution 1) in §\ref{['sec:integer']}, but note that the integer matrices contain heavy hitters (§\ref{['sec:issue_low_precision_int']}). Then, we describe our proposed algorithm, IM-Unpack (Contribution 2), to resolve these heavy hitters in §\ref{['sec:imunpack']}.
Figure 2: Training: Comparison of RoBERTa loss curves.
Figure 3: Training: Comparison of ViT-Small. $^\dagger$ and $^*$: we set $\beta = 16383$ and $\beta = 1023$, respectively, for the set $\{\nabla_{\mathbf{Y}}, \nabla_{\mathbf{P}}, \nabla_{\mathbf{O}}\}$.
Figure 4: Illustration of unpacking row vectors. The solid, dashed, and dotted arrows correspond to lines 5, 4, and 6 in Algo. \ref{['alg:unpack_row']}
Figure 5: Illustration of unpacking column vectors. The blue solid, dashed, and dotted arrows correspond to lines 5, 4, and 7 in Algo. \ref{['alg:unpack_row']}, and the gray dashed arrow corresponds to line 6 in Algo. \ref{['alg:unpack_row']}.
...and 4 more figures

Theorems & Definitions (1)

Remark 4.1

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

TL;DR

Abstract

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (1)