Accelerating Sparse Transformer Inference on GPU

Wenhao Dai; Haodong Deng; Mengfei Rong; Xinyu Yang; Hongyu Liu; Fangxin Liu; Hailong Yang; Qianwen Cao; Qingxiao Sun

Accelerating Sparse Transformer Inference on GPU

Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun

TL;DR

STOF addresses the inefficiency of sparse Transformer inference on GPUs by co-designing a unified MHA kernel capable of representing arbitrary masking patterns and a flexible operator fusion framework that maps fusion schemes to compilation templates. It introduces a two-level sparse storage format (BSR+bitmap) for MHA, enabling row-wise and block-wise kernels that adapt to mask sparsity and sequence length, while a hierarchical search engine expands fusion boundaries and tunes kernel parameters through a two-stage procedure. The system demonstrates up to 1.6x MHA speedups and 1.4x end-to-end gains over state-of-the-art methods across multiple models and GPU platforms, with low tuning overhead and good scalability to longer sequences and newer architectures. By providing a modular, template-based fusion layer and hardware-aware kernel selection, STOF offers a practical path to high-performance sparse Transformer inference in diverse deployment scenarios and lays groundwork for extending to future architectures and dynamic masking regimes.

Abstract

Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the stateof-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.

Accelerating Sparse Transformer Inference on GPU

TL;DR

Abstract

Accelerating Sparse Transformer Inference on GPU

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)