Table of Contents
Fetching ...

Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory

Dong Eun Kim, Tanvi Sharma, Kaushik Roy

TL;DR

This work addresses transformer inference bottlenecks—chiefly the softmax bottleneck and the quadratic memory footprint that scales with sequence length $l$—by introducing HASTILY, a compute-in-memory (CIM) accelerator built around unified compute-and-lookup modules (UCLMs) in dual-functionality 8T-SRAM arrays. Through a hardware–software co-design, it accelerates both softmax and matrix-multiplication, employing fine-grained pipelining to reduce memory pressure and a multi-core strategy to parallelize exponent computations and reductions; a tailored compiler maps BERT-like models to the CIM architecture. Key contributions include the UCLM SRAM design, the LUT-based exponential engine, multi-core softmax reduction, and a fine-grained pipelining scheme that lowers the effective memory dependency from $O(l^2)$ to near $O(l)$, enabling end-to-end transformer inference with significantly higher throughput and energy efficiency. Evaluations against Nvidia A40 and baseline CIM show throughput gains up to $9.8\times$ and energy efficiency improvements of $16$–$36\times$ TOPS/W for INT-8 BERT models, highlighting the practical impact of hardware-software co-design in energy-constrained transformer deployments.

Abstract

Transformers have become the backbone of neural network architecture for most machine learning applications. Their widespread use has resulted in multiple efforts on accelerating attention, the basic building block of transformers. This paper tackles the challenges associated with accelerating attention through a hardware-software co-design approach while leveraging compute-in-memory(CIM) architecture. In particular, our energy- and area-efficient CIM based accelerator, named HASTILY, aims to accelerate softmax computation, an integral operation in attention, and minimize their high on-chip memory requirements that grows quadratically with input sequence length. Our architecture consists of novel CIM units called unified compute and lookup modules(UCLMs) that integrate both lookup and multiply-accumulate functionality within the same SRAM array, incurring minimal area overhead over standard CIM arrays. Designed in TSMC 65nm, UCLMs can be used to concurrently perform exponential and matrix-vector multiplication operations. Complementing the proposed architecture, HASTILY features a fine-grained pipelining strategy for scheduling both attention and feed-forward layers, to reduce the quadratic dependence on sequence length to linear dependence. Further, for fast softmax computation which involves computing the maxima and sum of exponential values, such operations are parallelized across multiple cores using reduce and gather strategy. We evaluate our proposed architecture using a compiler tailored towards attention computation and a standard cycle-level CIM simulator. Our evaluation shows end-to-end throughput(TOPS) improvement of 4.4x-9.8x and 1.7x-5.9x over Nvidia A40 GPU and baseline CIM hardware, respectively, for BERT models with INT-8 precision. Additionally, it shows gains of 16x-36x in energy-efficiency(TOPS/W) over A40 GPU and similar energy-efficiency as baseline CIM hardware.

Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory

TL;DR

This work addresses transformer inference bottlenecks—chiefly the softmax bottleneck and the quadratic memory footprint that scales with sequence length —by introducing HASTILY, a compute-in-memory (CIM) accelerator built around unified compute-and-lookup modules (UCLMs) in dual-functionality 8T-SRAM arrays. Through a hardware–software co-design, it accelerates both softmax and matrix-multiplication, employing fine-grained pipelining to reduce memory pressure and a multi-core strategy to parallelize exponent computations and reductions; a tailored compiler maps BERT-like models to the CIM architecture. Key contributions include the UCLM SRAM design, the LUT-based exponential engine, multi-core softmax reduction, and a fine-grained pipelining scheme that lowers the effective memory dependency from to near , enabling end-to-end transformer inference with significantly higher throughput and energy efficiency. Evaluations against Nvidia A40 and baseline CIM show throughput gains up to and energy efficiency improvements of TOPS/W for INT-8 BERT models, highlighting the practical impact of hardware-software co-design in energy-constrained transformer deployments.

Abstract

Transformers have become the backbone of neural network architecture for most machine learning applications. Their widespread use has resulted in multiple efforts on accelerating attention, the basic building block of transformers. This paper tackles the challenges associated with accelerating attention through a hardware-software co-design approach while leveraging compute-in-memory(CIM) architecture. In particular, our energy- and area-efficient CIM based accelerator, named HASTILY, aims to accelerate softmax computation, an integral operation in attention, and minimize their high on-chip memory requirements that grows quadratically with input sequence length. Our architecture consists of novel CIM units called unified compute and lookup modules(UCLMs) that integrate both lookup and multiply-accumulate functionality within the same SRAM array, incurring minimal area overhead over standard CIM arrays. Designed in TSMC 65nm, UCLMs can be used to concurrently perform exponential and matrix-vector multiplication operations. Complementing the proposed architecture, HASTILY features a fine-grained pipelining strategy for scheduling both attention and feed-forward layers, to reduce the quadratic dependence on sequence length to linear dependence. Further, for fast softmax computation which involves computing the maxima and sum of exponential values, such operations are parallelized across multiple cores using reduce and gather strategy. We evaluate our proposed architecture using a compiler tailored towards attention computation and a standard cycle-level CIM simulator. Our evaluation shows end-to-end throughput(TOPS) improvement of 4.4x-9.8x and 1.7x-5.9x over Nvidia A40 GPU and baseline CIM hardware, respectively, for BERT models with INT-8 precision. Additionally, it shows gains of 16x-36x in energy-efficiency(TOPS/W) over A40 GPU and similar energy-efficiency as baseline CIM hardware.

Paper Structure

This paper contains 32 sections, 3 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: (a) Illustration of a transformer block showing multi-head attention and feed-forward blocks. AttentionIsAllYouNeed (b) Runtime breakdown for different sequence lengths and embedding size of 512 as measured on Ampere (A40) GPU Ampere. (c) Dependence of softmax on attention computation, limiting the overall throughput of CIM based hardware.
  • Figure 2: (a) Comparison of our work, HASTILY, with other works retransformerxformersoftmax1_hyftsoftmax2_ITA addressing the challenges associated with softmax computation. (b) Distinguishing the fine-grained pipelining technique proposed in this work compared to a prior work, ReTransformer retransformer.
  • Figure 3: (a) The hierarchical spatial architecture of the proposed CIM accelerator, (b) hardware architecture of each core in HASTILY, (c) micro-architecture details of each UCLM consisting of multiple SRAM arrays, (d) physical layout of each dual-functionality 8T-SRAM array implemented in TSMC65nm and (e) depiction of the two operations, compute on the left and lookup on the right, in UCLM and correspondingly in the SRAM.
  • Figure 4: (a) Each SRAM array stores a 128-entry lookup table for $2^k/128, 1\leq k\leq128$ (b) Illustration of parallel exponent operations execution in multiple UCLMs in a core of HASTILY architecture.
  • Figure 5: Computing softmax by gathering all required vectors in a single core versus parallelizing the compute across multiple cores and gathering them in a tree-like fashion.
  • ...and 8 more figures