BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Huizheng Wang; Hongbin Wang; Shaojun Wei; Yang Hu; Shouyi Yin

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Huizheng Wang, Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin

TL;DR

BitStopper tackles the primary inefficiency in dynamic sparsity attention by removing the external sparsity predictor and fusing bit-serial prediction with execution. It introduces BESF for early, bit-level stage fusion, LATS for adaptive, bit-grained token selection, and BAP for asynchronous bit-plane processing, implemented in a custom QK-PU/V-PU accelerator. The design yields substantial reductions in memory I/O and computation, delivering up to 2x–3x speedups and notable energy-efficiency improvements over SOTA DS accelerators while maintaining accuracy under INT12 quantization. Together, these contributions demonstrate a practical, scalable path to efficient DS attention that can benefit large-scale LLM inference and other transformer workloads.

Abstract

Attention-based large language models (LLMs) have transformed modern AI applications, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparsity (DS) attention mitigates this, yet its hardware efficiency is limited by the added prediction stage and the heavy memory traffic it entails. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design that operates without a sparsity predictor. First, a bit-serial enable stage fusion (BESF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, a lightweight and adaptive token selection (LATS) strategy is developed to work in concert with the bit-level sparsity speculation. Third, a bit-level asynchronous processing (BAP) strategy is employed to improve compute utilization during the on-demand bit-grained memory fetching. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Extensive evaluations demonstrate that, compared to state-of-the-art (SOTA) Transformer accelerators, BitStopper achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

TL;DR

Abstract

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)