Table of Contents
Fetching ...

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

Ashkan Moradifirouzabadi, Divya Sri Dodla, Mingu Kang

TL;DR

This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65 nm CMOS technology with an analog computing-in-memory (CIM) core, which prunes 75% of low-score tokens on average during runtime at ultra-low power and delay.

Abstract

The attention mechanism is a key computing kernel of Transformers, calculating pairwise correlations across the entire input sequence. The computing complexity and frequent memory access in computing self-attention put a huge burden on the system especially when the sequence length increases. This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65nm CMOS technology. We propose an analog computing-in-memory (CIM) core, which prunes ~75% of low-score tokens on average during runtime at ultra-low power and delay. Additionally, a digital processor performs precise computations only for ~25% unpruned tokens selected by the analog CIM core, preventing accuracy degradation. Measured results show peak energy efficiency of 14.8 and 1.65 TOPS/W, and peak area efficiency of 976.6 and 79.4 GOPS/mm$^\mathrm{2}$ in the analog core and the system-on-chip (SoC), respectively.

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

TL;DR

This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65 nm CMOS technology with an analog computing-in-memory (CIM) core, which prunes 75% of low-score tokens on average during runtime at ultra-low power and delay.

Abstract

The attention mechanism is a key computing kernel of Transformers, calculating pairwise correlations across the entire input sequence. The computing complexity and frequent memory access in computing self-attention put a huge burden on the system especially when the sequence length increases. This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65nm CMOS technology. We propose an analog computing-in-memory (CIM) core, which prunes ~75% of low-score tokens on average during runtime at ultra-low power and delay. Additionally, a digital processor performs precise computations only for ~25% unpruned tokens selected by the analog CIM core, preventing accuracy degradation. Measured results show peak energy efficiency of 14.8 and 1.65 TOPS/W, and peak area efficiency of 976.6 and 79.4 GOPS/mm in the analog core and the system-on-chip (SoC), respectively.
Paper Structure (7 sections, 9 figures, 2 tables)

This paper contains 7 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of the token pruning mechanism using analog in-memory computing and hybrid digital processing.
  • Figure 2: Overall architecture and dataflow of the design.
  • Figure 3: Transposable memory array. (a) The array architecture supporting CIM and standard read operations with the proposed 9-T bitcell, (b) The timing diagram of CIM operation.
  • Figure 4: Bitline processor (BLP) for binary-weighted summation with a timing diagram.
  • Figure 5: Pruning accuracy of the CIM core. (a) Pruning decision map of the comparator with different $q$ and $k$, and a threshold of 0. (b) SSCS circuitry. (c) The effect of incorporating SSCS on the pruning accuracy with different input sparsities.
  • ...and 4 more figures