Table of Contents
Fetching ...

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Huizheng Wang, Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin

TL;DR

BitStopper tackles the primary inefficiency in dynamic sparsity attention by removing the external sparsity predictor and fusing bit-serial prediction with execution. It introduces BESF for early, bit-level stage fusion, LATS for adaptive, bit-grained token selection, and BAP for asynchronous bit-plane processing, implemented in a custom QK-PU/V-PU accelerator. The design yields substantial reductions in memory I/O and computation, delivering up to 2x–3x speedups and notable energy-efficiency improvements over SOTA DS accelerators while maintaining accuracy under INT12 quantization. Together, these contributions demonstrate a practical, scalable path to efficient DS attention that can benefit large-scale LLM inference and other transformer workloads.

Abstract

Attention-based large language models (LLMs) have transformed modern AI applications, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparsity (DS) attention mitigates this, yet its hardware efficiency is limited by the added prediction stage and the heavy memory traffic it entails. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design that operates without a sparsity predictor. First, a bit-serial enable stage fusion (BESF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, a lightweight and adaptive token selection (LATS) strategy is developed to work in concert with the bit-level sparsity speculation. Third, a bit-level asynchronous processing (BAP) strategy is employed to improve compute utilization during the on-demand bit-grained memory fetching. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Extensive evaluations demonstrate that, compared to state-of-the-art (SOTA) Transformer accelerators, BitStopper achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

TL;DR

BitStopper tackles the primary inefficiency in dynamic sparsity attention by removing the external sparsity predictor and fusing bit-serial prediction with execution. It introduces BESF for early, bit-level stage fusion, LATS for adaptive, bit-grained token selection, and BAP for asynchronous bit-plane processing, implemented in a custom QK-PU/V-PU accelerator. The design yields substantial reductions in memory I/O and computation, delivering up to 2x–3x speedups and notable energy-efficiency improvements over SOTA DS accelerators while maintaining accuracy under INT12 quantization. Together, these contributions demonstrate a practical, scalable path to efficient DS attention that can benefit large-scale LLM inference and other transformer workloads.

Abstract

Attention-based large language models (LLMs) have transformed modern AI applications, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparsity (DS) attention mitigates this, yet its hardware efficiency is limited by the added prediction stage and the heavy memory traffic it entails. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design that operates without a sparsity predictor. First, a bit-serial enable stage fusion (BESF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, a lightweight and adaptive token selection (LATS) strategy is developed to work in concert with the bit-level sparsity speculation. Third, a bit-level asynchronous processing (BAP) strategy is employed to improve compute utilization during the on-demand bit-grained memory fetching. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Extensive evaluations demonstrate that, compared to state-of-the-art (SOTA) Transformer accelerators, BitStopper achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.

Paper Structure

This paper contains 19 sections, 4 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Workflow comparison of (a) traditional DS works and (b) this work, where runtime identifies sparsity directly during formal computation without an extra sparsity predictor.
  • Figure 2: (a) Architecture of Transformer-based LLMs. (b) The workflow of the current DS works. (c) Illustration of the BitStopper, featuring stage-fusion.
  • Figure 3: (a) Comparison of power distribution between dense attention and DS attention on TSMC 28 nm. (b) Accuracy of various token-selection strategies.
  • Figure 4: Fundamental limitations of current token selection strategies.
  • Figure 5: Illustration of the bit-serial enabled stage fusion (BESF) mechanism.
  • ...and 9 more figures