Table of Contents
Fetching ...

Topkima-Former: Low-energy, Low-Latency Inference for Transformers using top-k In-memory ADC

Shuai Dong, Junyi Yang, Xiaoqi Peng, Hongyang Shang, Ye Ke, Xiaofeng Yang, Hongjie Liu, Arindam Basu

TL;DR

This work proposes innovations at the circuit, algorithm and architecture levels to accelerate the transformer, and introduces a fine pipeline for efficiently scheduling data flows and an improved scale-free technique for removing scaling cost.

Abstract

Transformer model has gained prominence as a popular deep neural network architecture for neural language processing (NLP) and computer vision (CV) applications. However, the extensive use of nonlinear operations, like softmax, poses a performance bottleneck during transformer inference and comprises up to 40% of the total latency. Hence, we propose innovations at the circuit, architecture, and algorithm levels to accelerate the transformer. At the circuit level, we propose topkima-combining top-k activation selection with in-memory ADC (IMA) to implement a low-energy and low-latency softmax without any sorting latency. Only the k largest activations are sent to the softmax calculation block, reducing the huge computational cost of softmax. Using a modified training scheme with top-k only in the forward pass, experimental results demonstrate only a 0.4% to 1.2% reduction in accuracy across ViT, distilBERT, and BERT-base models when evaluated on CIFAR-10, CIFAR-100, and SQuAD datasets with k=5. At the architecture level, an improved scale-free technique is introduced to reduce the computational cost of attention. The combined system, dubbed Topkima-Former, enhances 1.8x-84x speedup and 1.3x-35x energy efficiency (EE) over prior In-memory computing (IMC) accelerators. Compared to a conventional softmax macro and a digital top-k (Dtopk) softmax macro, our proposed tokima softmax macro achieves about 15x and 8x faster speed respectively.

Topkima-Former: Low-energy, Low-Latency Inference for Transformers using top-k In-memory ADC

TL;DR

This work proposes innovations at the circuit, algorithm and architecture levels to accelerate the transformer, and introduces a fine pipeline for efficiently scheduling data flows and an improved scale-free technique for removing scaling cost.

Abstract

Transformer model has gained prominence as a popular deep neural network architecture for neural language processing (NLP) and computer vision (CV) applications. However, the extensive use of nonlinear operations, like softmax, poses a performance bottleneck during transformer inference and comprises up to 40% of the total latency. Hence, we propose innovations at the circuit, architecture, and algorithm levels to accelerate the transformer. At the circuit level, we propose topkima-combining top-k activation selection with in-memory ADC (IMA) to implement a low-energy and low-latency softmax without any sorting latency. Only the k largest activations are sent to the softmax calculation block, reducing the huge computational cost of softmax. Using a modified training scheme with top-k only in the forward pass, experimental results demonstrate only a 0.4% to 1.2% reduction in accuracy across ViT, distilBERT, and BERT-base models when evaluated on CIFAR-10, CIFAR-100, and SQuAD datasets with k=5. At the architecture level, an improved scale-free technique is introduced to reduce the computational cost of attention. The combined system, dubbed Topkima-Former, enhances 1.8x-84x speedup and 1.3x-35x energy efficiency (EE) over prior In-memory computing (IMC) accelerators. Compared to a conventional softmax macro and a digital top-k (Dtopk) softmax macro, our proposed tokima softmax macro achieves about 15x and 8x faster speed respectively.

Paper Structure

This paper contains 12 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The attention module
  • Figure 2: Topkima-M hardware: (a) Block diagram. (b) Concept of early stopping in topkima. (c) Circuit diagram detail of one column. (d) Basic multiplication of proposed dual 10T SRAM cell. (e) Example timing diagram for 3 columns with top-1 selected.
  • Figure 3: Accuracy evaluation of top-$k$
  • Figure 4: Hardware evaluation results. (a) Latency breakdown across Conv-SM, Dtopk-SM and topkima-SM. (b) Theoretical and simulated MAC value. (c) Impact of sub-top-$k$. (d) Different scale implementations. (e) Latency and (f) energy breakdown by components. (g) Latency and (h) energy breakdown by operations.