ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

Gamze İslamoğlu; Moritz Scherer; Gianna Paulin; Tim Fischer; Victor J. B. Jung; Angelo Garofalo; Luca Benini

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

Gamze İslamoğlu, Moritz Scherer, Gianna Paulin, Tim Fischer, Victor J. B. Jung, Angelo Garofalo, Luca Benini

TL;DR

This work presents ITA, an energy-efficient transformer accelerator designed for embedded inference using 8-bit quantization and a streaming, integer-only softmax to minimize data movement. By adopting a weight-stationary dataflow and a tile-based architecture with wide dot-product units, ITA achieves competitive energy and area efficiency in 22 nm FD-SOI at 0.8 V. A hardware-friendly softmax that operates directly on quantized values in a streaming fashion is central to reducing memory traffic and power, with MAE comparable to floating-point baselines. Evaluation shows ITA delivering 16.9 OPS standalone energy efficiency and 5.93 OPS/mm$^2$ area efficiency, outperforming many peers and approaching Nvidia’s best in energy while surpassing in area, indicating strong potential for embedded transformer deployment. The results highlight the practicality of integer quantization and streaming softmax for efficient, real-time transformer inference on resource-constrained platforms.

Abstract

Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movement and energy consumption. ITA achieves competitive energy efficiency with respect to state-of-the-art transformer accelerators with 16.9 TOPS/W, while outperforming them in area efficiency with 5.93 TOPS/mm$^2$ in 22 nm fully-depleted silicon-on-insulator technology at 0.8 V.

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

TL;DR

area efficiency, outperforming many peers and approaching Nvidia’s best in energy while surpassing in area, indicating strong potential for embedded transformer deployment. The results highlight the practicality of integer quantization and streaming softmax for efficient, real-time transformer inference on resource-constrained platforms.

Abstract

in 22 nm fully-depleted silicon-on-insulator technology at 0.8 V.

Paper Structure (14 sections, 5 equations, 6 figures, 1 table)

This paper contains 14 sections, 5 equations, 6 figures, 1 table.

Introduction
Preliminaries and Related Work
Transformers
Softmax
Related Work
Architecture
Softmax
Evaluation
Physical Implementation and Measurements
Experimental Results
Softmax
Performance Evaluation
Comparison to State-of-the-Art
Conclusion

Figures (6)

Figure 1: Transformer encoder and multi-head attention. S: sequence length, E: embedding size, P: projection space, H: number of heads.
Figure 2: Architecture of ITA with 8-bit inputs and weights. The softmax block is detailed in \ref{['fig:softmax']}.
Figure 3: Workload mapping and computation phases.
Figure 4: Softmax implementation. Buffers are shown in blue.
Figure 5: Effect of softmax and quantization on attention probabilities.
...and 1 more figures

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

TL;DR

Abstract

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)