LAPA: Log-Domain Prediction-Driven Dynamic Sparsity Accelerator for Transformer Model
Huizheng Wang, Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin
TL;DR
This work tackles the dynamic bottlenecks of Transformer computation by introducing cross-stage sparsity (CSS) to accelerate both QKV projection and self-attention. It presents LAPA, a software-hardware co-design featuring asymmetric leading-one computing (ALOC), log-domain multi-round shifting accumulation (MRSA), and data-feature dependent filtering (DDF), plus a dedicated accelerator with a speculation and execution unit. The CSS-based approach enables on-demand QKV generation guided by a sparse attention mask, achieving significant energy-efficiency gains over SOTA accelerators and substantial throughput improvements across large language models. Hardware results at 28nm show a compact area, moderate power, and high GOPS throughput, underscoring LAPA’s practicality for dynamic Transformer workloads. Overall, LAPA demonstrates that cross-stage sparsity, when paired with multiplication-free prediction and multi-round pruning, can substantially reduce the computational and energy burden of Transformer inference in real-world scenarios.
Abstract
Attention-based Transformers have revolutionized natural language processing (NLP) and shown strong performance in computer vision (CV) tasks. However, as the input sequence varies, the computational bottlenecks in Transformer models exhibit dynamic behavior across stages, which calls for a cross-stage sparse acceleration strategy. Unfortunately, most existing sparse Transformer approaches are single-stage based, and their sparsity prediction mechanisms lead to significant power overhead when applied across multiple stages. To this end, this paper proposes a log-domain attention prediction algorithm-architecture co-design, named LAPA. First, an asymmetric leading one computing (ALOC) scheme is designed to eliminate expensive multiplications. Next, a mixed-precision multi-round shifting accumulation (MRSA) mechanism is further proposed to mitigate the accumulation overhead. A data-feature dependent filter (DDF) strategy is designed to work in concert with the MRSA process. Finally, an elaborate accelerator is designed to translate the theoretical enhancement into practical hardware improvement. Experimental results show that LAPA achieves 3.52x, 3.24x and 2.79x higher energy efficiency than the state-of-the-art (SOTA) works Spatten, Sanger and FACT, respectively.
