Table of Contents
Fetching ...

SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling

Huizheng Wang, Jiahao Fang, Xinru Tang, Zhiheng Yue, Jinxi Li, Yubin Qin, Sihan Guan, Qize Yang, Yang Wang, Chao Li, Yang Hu, Shouyi Yin

TL;DR

SOFA addresses the memory and latency bottlenecks of dynamic sparsity in long-sequence Transformer inference (LTPP) by introducing cross-stage tiling and a compute-memory co-design. Its core innovations—DLZS for low-cost sparsity prediction, sphere-search aided distributed sorting (SADS), and sorted-updating FlashAttention (SU-FA)—enable fine-grained, cross-stage pipelining that reduces memory traffic and latency. A Bayesian-design-space exploration tunes tiling and sparsity parameters, while a dedicated SOFA accelerator implements these techniques with a reusable DLZS engine, flexible SADS sorting, and a max-assurance SU-FA engine. Experimental results on 20 benchmarks show dramatic gains, with up to 9.5× speedups over Nvidia A100 and 71.5× energy efficiency, underscoring the practical impact of cross-stage optimization for LTPP Transformer inference.

Abstract

Benefiting from the self-attention mechanism, Transformer models have attained impressive contextual comprehension capabilities for lengthy texts. The requirements of high-throughput inference arise as the large language models (LLMs) become increasingly prevalent, which calls for large-scale token parallel processing (LTPP). However, existing dynamic sparse accelerators struggle to effectively handle LTPP, as they solely focus on separate stage optimization, and with most efforts confined to computational enhancements. By re-examining the end-to-end flow of dynamic sparse acceleration, we pinpoint an ever-overlooked opportunity that the LTPP can exploit the intrinsic coordination among stages to avoid excessive memory access and redundant computation. Motivated by our observation, we present SOFA, a cross-stage compute-memory efficient algorithm-hardware co-design, which is tailored to tackle the challenges posed by LTPP of Transformer inference effectively. We first propose a novel leading zero computing paradigm, which predicts attention sparsity by using log-based add-only operations to avoid the significant overhead of prediction. Then, a distributed sorting and a sorted updating FlashAttention mechanism are proposed with a cross-stage coordinated tiling principle, which enables fine-grained and lightweight coordination among stages, helping optimize memory access and latency. Further, we propose a SOFA accelerator to support these optimizations efficiently. Extensive experiments on 20 benchmarks show that SOFA achieves $9.5\times$ speed up and $71.5\times$ higher energy efficiency than Nvidia A100 GPU. Compared to 8 SOTA accelerators, SOFA achieves an average $15.8\times$ energy efficiency, $10.3\times$ area efficiency and $9.3\times$ speed up, respectively.

SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling

TL;DR

SOFA addresses the memory and latency bottlenecks of dynamic sparsity in long-sequence Transformer inference (LTPP) by introducing cross-stage tiling and a compute-memory co-design. Its core innovations—DLZS for low-cost sparsity prediction, sphere-search aided distributed sorting (SADS), and sorted-updating FlashAttention (SU-FA)—enable fine-grained, cross-stage pipelining that reduces memory traffic and latency. A Bayesian-design-space exploration tunes tiling and sparsity parameters, while a dedicated SOFA accelerator implements these techniques with a reusable DLZS engine, flexible SADS sorting, and a max-assurance SU-FA engine. Experimental results on 20 benchmarks show dramatic gains, with up to 9.5× speedups over Nvidia A100 and 71.5× energy efficiency, underscoring the practical impact of cross-stage optimization for LTPP Transformer inference.

Abstract

Benefiting from the self-attention mechanism, Transformer models have attained impressive contextual comprehension capabilities for lengthy texts. The requirements of high-throughput inference arise as the large language models (LLMs) become increasingly prevalent, which calls for large-scale token parallel processing (LTPP). However, existing dynamic sparse accelerators struggle to effectively handle LTPP, as they solely focus on separate stage optimization, and with most efforts confined to computational enhancements. By re-examining the end-to-end flow of dynamic sparse acceleration, we pinpoint an ever-overlooked opportunity that the LTPP can exploit the intrinsic coordination among stages to avoid excessive memory access and redundant computation. Motivated by our observation, we present SOFA, a cross-stage compute-memory efficient algorithm-hardware co-design, which is tailored to tackle the challenges posed by LTPP of Transformer inference effectively. We first propose a novel leading zero computing paradigm, which predicts attention sparsity by using log-based add-only operations to avoid the significant overhead of prediction. Then, a distributed sorting and a sorted updating FlashAttention mechanism are proposed with a cross-stage coordinated tiling principle, which enables fine-grained and lightweight coordination among stages, helping optimize memory access and latency. Further, we propose a SOFA accelerator to support these optimizations efficiently. Extensive experiments on 20 benchmarks show that SOFA achieves speed up and higher energy efficiency than Nvidia A100 GPU. Compared to 8 SOTA accelerators, SOFA achieves an average energy efficiency, area efficiency and speed up, respectively.
Paper Structure (27 sections, 2 equations, 21 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 2 equations, 21 figures, 4 tables, 1 algorithm.

Figures (21)

  • Figure 1: Transformer memory and computation breakdown for long sequence.
  • Figure 2: Dynamic sparsity challenges for LTPP and SOFA's software and hardware co-design.
  • Figure 3: MAT for SOTA dynamic sparsity accelerators (FACT qin2023fact, Energon zhou2022energon) with diverse parallelisms.
  • Figure 4: Basic components of a Transformer model and operation intensity.
  • Figure 5: Process of FlashAttention-2 and its computation overhead.
  • ...and 16 more figures