EXAQ: Exponent Aware Quantization For LLMs Acceleration

Moran Shkolnik; Maxim Fishman; Brian Chmiel; Hilla Ben-Yaacov; Ron Banner; Kfir Yehuda Levy

EXAQ: Exponent Aware Quantization For LLMs Acceleration

Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron Banner, Kfir Yehuda Levy

TL;DR

This work proposes an analytical approach to determine the optimal clipping value for the input to the softmax function, enabling sub-4-bit quantization for LLMs inference, and allows, for the first time, an acceleration of approximately 4x in the accumulation phase.

Abstract

Quantization has established itself as the primary approach for decreasing the computational and storage expenses associated with Large Language Models (LLMs) inference. The majority of current research emphasizes quantizing weights and activations to enable low-bit general-matrix-multiply (GEMM) operations, with the remaining non-linear operations executed at higher precision. In our study, we discovered that following the application of these techniques, the primary bottleneck in LLMs inference lies in the softmax layer. The softmax operation comprises three phases: exponent calculation, accumulation, and normalization, Our work focuses on optimizing the first two phases. We propose an analytical approach to determine the optimal clipping value for the input to the softmax function, enabling sub-4-bit quantization for LLMs inference. This method accelerates the calculations of both $e^x$ and $\sum(e^x)$ with minimal to no accuracy degradation. For example, in LLaMA1-30B, we achieve baseline performance with 2-bit quantization on the well-known "Physical Interaction: Question Answering" (PIQA) dataset evaluation. This ultra-low bit quantization allows, for the first time, an acceleration of approximately 4x in the accumulation phase. The combination of accelerating both $e^x$ and $\sum(e^x)$ results in a 36.9% acceleration in the softmax operation.

EXAQ: Exponent Aware Quantization For LLMs Acceleration

TL;DR

Abstract

and

with minimal to no accuracy degradation. For example, in LLaMA1-30B, we achieve baseline performance with 2-bit quantization on the well-known "Physical Interaction: Question Answering" (PIQA) dataset evaluation. This ultra-low bit quantization allows, for the first time, an acceleration of approximately 4x in the accumulation phase. The combination of accelerating both

and

results in a 36.9% acceleration in the softmax operation.

Paper Structure (29 sections, 5 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 29 sections, 5 equations, 6 figures, 6 tables, 2 algorithms.

Introduction
Our paper introduces several key contributions:
Motivation
Exponent-Aware Quantization (EXAQ)
Problem Formulation
Algorithm implementation
Exponent calculation
Accelerated denominator accumulation
Experiments
Accuracy experiments
Experimental settings
Quantization settings.
Inference accuracy evaluation
Runtime experiments
Related work
...and 14 more sections

Figures (6)

Figure 1: Distribution of runtime consumption by the layer type. The chart illustrates the proportional runtime spent on various layer types during model execution, highlighting the significant computational burden imposed by the softmax layer, which accounts for $39\%$ of the total runtime.
Figure 2: Illustration of the distortion at the output of $e^x$ due to the quantization and clipping of the inputs. The clipping value $C$ is the threshold we aim to optimize. A very negative $C$ reduces clipping error but increases quantization error. The total mean squared error is the sum of these two contributions.
Figure 3: Optimal clipping value vs. standard deviation of softmax input for different bit widths. The analysis and simulation results agree, demonstrating the accuracy of the analytical model. The simulation was conducted by generating 1000 samples from a normal distribution with mean 0 and various standard deviations.
Figure 4: Original softmax algorithm
Figure 5: Illustration of the proposed accelerated denominator accumulation. The $LUT_{sum}$ lookup table contains pre-computed values of sums of the exponents of 4 consecutive quantized tensor elements. In the left example, the integer representations of the quantized values are $X_q[0:4]=[0,3,0,3]$, and their corresponding floating-point representations are $[q[0], q[3], q[0], q[3]]$. The lookup key is constructed by concatenating the 2-bit counterparts of the 4 integer representations into a single byte.
...and 1 more figures

EXAQ: Exponent Aware Quantization For LLMs Acceleration

TL;DR

Abstract

EXAQ: Exponent Aware Quantization For LLMs Acceleration

Authors

TL;DR

Abstract

Table of Contents

Figures (6)