Table of Contents
Fetching ...

Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

Dimitrios Danopoulos, Enrico Lupi, Michael Kagan, Maurizio Pierini

Abstract

Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platform rely upon either bfloat16 arithmetic or LUTs to perform the exponential operation, which might limit the throughput of the platform and fail to utilize the high-throughput integer vector processing units of the AI Engine. In contrast, HCCS provides a natural mapping to the AI Engines' int8 multiply accumulate (MAC) units. To the best of our knowledge, this is the first int8 optimized softmax surrogate for AMD AI engines that significantly exceeds the speed performance of other reference implementations while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining.

Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

Abstract

Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platform rely upon either bfloat16 arithmetic or LUTs to perform the exponential operation, which might limit the throughput of the platform and fail to utilize the high-throughput integer vector processing units of the AI Engine. In contrast, HCCS provides a natural mapping to the AI Engines' int8 multiply accumulate (MAC) units. To the best of our knowledge, this is the first int8 optimized softmax surrogate for AMD AI engines that significantly exceeds the speed performance of other reference implementations while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining.

Paper Structure

This paper contains 32 sections, 12 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of the softmax proxy pipeline on hardware. Five stages; vector max reduction, unsigned distance and clamp, affine score via int8 MAC, sum reduction, and reciprocal-based normalization are all expressed as integer vector operations with no floating-point or LUT-based exponential at any stage.
  • Figure 2: Attention probability curves for broad and focused heads from BERT-Tiny (top) and selected BERT-Small heads (bottom), comparing float32 softmax and retrained HCCS. Differences in absolute probabilities are expected, especially on focused heads: HCCS is a surrogate rather than an exact softmax replica, and retraining adapts the model to a new but effective attention distribution.
  • Figure 3: Aggregate softmax throughput vs. number of AMD AI Engine tiles.