Table of Contents
Fetching ...

ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity

Hongxiang Liu, Zhifang Deng, Tong Pu, Shengli Lu

TL;DR

ESACT introduces SPLS, a locally-informed sparsity mechanism for end-to-end acceleration of Transformers. Leveraging HybridLog Quantization, it predicts local attention sparsity prior to QK generation and guides sparsification across QKV, attention, and FFN. Key hardware innovations—bit-level prediction, progressive generation, and dynamic allocation—enable substantial end-to-end speedups and energy efficiency, demonstrated across 26 benchmarks with minimal accuracy loss. The approach outperforms state-of-the-art attention accelerators in energy efficiency and shows strong potential for practical deployment in compute-intensive Transformer workloads.

Abstract

Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high computational cost hinders efficient hardware deployment. Sparsity offers a promising solution, yet most existing accelerators exploit only intra-row sparsity in attention, while few consider inter-row sparsity. Approaches leveraging inter-row sparsity often rely on costly global similarity estimation, which diminishes the acceleration benefits of sparsity, and typically apply sparsity to only one or two transformer components. Through careful analysis of the attention distribution and computation flow, we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead. Motivated by this observation, we propose ESACT, an end-to-end sparse accelerator for compute-intensive Transformers. ESACT centers on the Sparsity Prediction with Local Similarity (SPLS) mechanism, which leverages HLog quantization to accurately predict local attention sparsity prior to QK generation, achieving efficient sparsity across all transformer components. To support efficient hardware realization, we introduce three architectural innovations. Experimental results on 26 benchmarks demonstrate that SPLS reduces total computation by 52.03% with less than 1% accuracy loss. ESACT achieves an end-to-end energy efficiency of 3.29 TOPS/W, and improves attention-level energy efficiency by 2.95x and 2.26x over SOTA attention accelerators SpAtten and Sanger, respectively.

ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity

TL;DR

ESACT introduces SPLS, a locally-informed sparsity mechanism for end-to-end acceleration of Transformers. Leveraging HybridLog Quantization, it predicts local attention sparsity prior to QK generation and guides sparsification across QKV, attention, and FFN. Key hardware innovations—bit-level prediction, progressive generation, and dynamic allocation—enable substantial end-to-end speedups and energy efficiency, demonstrated across 26 benchmarks with minimal accuracy loss. The approach outperforms state-of-the-art attention accelerators in energy efficiency and shows strong potential for practical deployment in compute-intensive Transformer workloads.

Abstract

Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high computational cost hinders efficient hardware deployment. Sparsity offers a promising solution, yet most existing accelerators exploit only intra-row sparsity in attention, while few consider inter-row sparsity. Approaches leveraging inter-row sparsity often rely on costly global similarity estimation, which diminishes the acceleration benefits of sparsity, and typically apply sparsity to only one or two transformer components. Through careful analysis of the attention distribution and computation flow, we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead. Motivated by this observation, we propose ESACT, an end-to-end sparse accelerator for compute-intensive Transformers. ESACT centers on the Sparsity Prediction with Local Similarity (SPLS) mechanism, which leverages HLog quantization to accurately predict local attention sparsity prior to QK generation, achieving efficient sparsity across all transformer components. To support efficient hardware realization, we introduce three architectural innovations. Experimental results on 26 benchmarks demonstrate that SPLS reduces total computation by 52.03% with less than 1% accuracy loss. ESACT achieves an end-to-end energy efficiency of 3.29 TOPS/W, and improves attention-level energy efficiency by 2.95x and 2.26x over SOTA attention accelerators SpAtten and Sanger, respectively.

Paper Structure

This paper contains 21 sections, 1 equation, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Computation breakdown of BERT-Large and the challenge of using global similarity for attention acceleration.
  • Figure 2: Computational flow of a Transformer block.
  • Figure 3: Visualization of the attention distribution.
  • Figure 4: Percentage of heads exhibiting local similarity across different layers in BERT and GPT.
  • Figure 5: (a) Proposed SPLS mechanism. (b) Traditional attention prediction.
  • ...and 16 more figures