Table of Contents
Fetching ...

PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion

Huizheng Wang, Hongbin Wang, Zichuan Wang, Zhiheng Yue, Yang Wang, Chao Li, Yang Hu, Shouyi Yin

TL;DR

PADE tackles the high cost of self-attention by eliminating the sparsity predictor through a unified bit-serial stage fusion design. It introduces three core innovations—BUI-GF, BS-OOE, and ISTA—implemented in a specialized PADE accelerator with QK-PU and V-PU to realize predictor-free dynamic sparse attention. Empirical results show up to 31.1x energy efficiency gains over Nvidia H100 and strong advantages over state-of-the-art accelerators, with good scalability to long sequences and deployment scenarios. The work provides a practical pathway for efficient sparse attention, including hardware-aware optimizations and deployment guidance.

Abstract

Attention-based models have revolutionized AI, but the quadratic cost of self-attention incurs severe computational and memory overhead. Sparse attention methods alleviate this by skipping low-relevance token pairs. However, current approaches lack practicality due to the heavy expense of added sparsity predictor, which severely drops their hardware efficiency. This paper advances the state-of-the-art (SOTA) by proposing a bit-serial enable stage-fusion (BSF) mechanism, which eliminates the need for a separate predictor. However, it faces key challenges: 1) Inaccurate bit-sliced sparsity speculation leads to incorrect pruning; 2) Hardware under-utilization due to fine-grained and imbalanced bit-level workloads. 3) Tiling difficulty caused by the row-wise dependency in sparsity pruning criteria. We propose PADE, a predictor-free algorithm-hardware co-design for dynamic sparse attention acceleration. PADE features three key innovations: 1) Bit-wise uncertainty interval-enabled guard filtering (BUI-GF) strategy to accurately identify trivial tokens during each bit round; 2) Bidirectional sparsity-based out-of-order execution (BS-OOE) to improve hardware utilization; 3) Interleaving-based sparsity-tiled attention (ISTA) to reduce both I/O and computational complexity. These techniques, combined with custom accelerator designs, enable practical sparsity acceleration without relying on an added sparsity predictor. Extensive experiments on 22 benchmarks show that PADE achieves 7.43x speed up and 31.1x higher energy efficiency than Nvidia H100 GPU. Compared to SOTA accelerators, PADE achieves 5.1x, 4.3x and 3.4x energy saving than Sanger, DOTA and SOFA.

PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion

TL;DR

PADE tackles the high cost of self-attention by eliminating the sparsity predictor through a unified bit-serial stage fusion design. It introduces three core innovations—BUI-GF, BS-OOE, and ISTA—implemented in a specialized PADE accelerator with QK-PU and V-PU to realize predictor-free dynamic sparse attention. Empirical results show up to 31.1x energy efficiency gains over Nvidia H100 and strong advantages over state-of-the-art accelerators, with good scalability to long sequences and deployment scenarios. The work provides a practical pathway for efficient sparse attention, including hardware-aware optimizations and deployment guidance.

Abstract

Attention-based models have revolutionized AI, but the quadratic cost of self-attention incurs severe computational and memory overhead. Sparse attention methods alleviate this by skipping low-relevance token pairs. However, current approaches lack practicality due to the heavy expense of added sparsity predictor, which severely drops their hardware efficiency. This paper advances the state-of-the-art (SOTA) by proposing a bit-serial enable stage-fusion (BSF) mechanism, which eliminates the need for a separate predictor. However, it faces key challenges: 1) Inaccurate bit-sliced sparsity speculation leads to incorrect pruning; 2) Hardware under-utilization due to fine-grained and imbalanced bit-level workloads. 3) Tiling difficulty caused by the row-wise dependency in sparsity pruning criteria. We propose PADE, a predictor-free algorithm-hardware co-design for dynamic sparse attention acceleration. PADE features three key innovations: 1) Bit-wise uncertainty interval-enabled guard filtering (BUI-GF) strategy to accurately identify trivial tokens during each bit round; 2) Bidirectional sparsity-based out-of-order execution (BS-OOE) to improve hardware utilization; 3) Interleaving-based sparsity-tiled attention (ISTA) to reduce both I/O and computational complexity. These techniques, combined with custom accelerator designs, enable practical sparsity acceleration without relying on an added sparsity predictor. Extensive experiments on 22 benchmarks show that PADE achieves 7.43x speed up and 31.1x higher energy efficiency than Nvidia H100 GPU. Compared to SOTA accelerators, PADE achieves 5.1x, 4.3x and 3.4x energy saving than Sanger, DOTA and SOFA.

Paper Structure

This paper contains 26 sections, 7 equations, 26 figures, 3 tables.

Figures (26)

  • Figure 1: Comparison of (a) current DS works and (b) PADE.
  • Figure 2: (a) Power breakdown of dense and DS attention (SA: Sanger, SO: SOFA) with TSMC 28nm across executor bit-widths of Llama7B. (b) Power ratio of predictor and executor versus SL with under 8-bit quantized executor.
  • Figure 3: Illustration of the DS attention mechanism.
  • Figure 4: (a) Traditional DS works, featuring stage splitting. (b) Our work features stage-fusion. (c) Reduced complexity for stage splitting and stage fusion.
  • Figure 5: Challenges for bit-serial enable stage fusion. (a)-(b) Inaccuracy (c)-(d) Hardware under-utilization. (e)-(f) Tiling difficulty.
  • ...and 21 more figures