FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers
Liang Qiao, Yue Dai, Yeqi Huang, Hongyu Kan, Jun Shi, Hong An
TL;DR
This work tackles the computational bottleneck of diffusion transformers by introducing FlashOmni, a unified sparse attention engine that accommodates arbitrary sparsity patterns. It introduces an Update-Dispatch paradigm, compact 8-bit sparse symbols $S_c$ and $S_s$, and general sparse attention plus sparse GEMMs (GEMM-$Q$ and GEMM-$O$) to support multi-granularity sparsity. Empirical results show near-linear speedups and substantial end-to-end gains across multiple diffusion models while preserving visual quality, with up to 2.5–3.8× speedups in GEMM-$O$ and about 1.5× end-to-end acceleration on HunyuanVideo. The proposed framework enables scalable, real-time inference for high-resolution image and video diffusion tasks on commodity GPUs without retraining, by unifying sparsity strategies into a single kernel and optimized linear layers.
Abstract
Multi-Modal Diffusion Transformers (DiTs) demonstrate exceptional capabilities in visual synthesis, yet their deployment remains constrained by substantial computational demands. To alleviate this bottleneck, many sparsity-based acceleration methods have been proposed. However, their diverse sparsity patterns often require customized kernels for high-performance inference, limiting universality. We propose FlashOmni, a unified sparse attention engine compatible with arbitrary DiT architectures. FlashOmni introduces flexible sparse symbols to standardize the representation of a wide range of sparsity strategies, such as feature caching and block-sparse skipping. This unified abstraction enables the execution of diverse sparse computations within a single attention kernel. In addition, FlashOmni designs optimized sparse GEMMs for attention blocks, leveraging sparse symbols to eliminate redundant computations and further improve efficiency. Experiments demonstrate that FlashOmni delivers near-linear, closely matching the sparsity ratio speedup (1:1) in attention and GEMM-$Q$, and achieves 2.5$\times$-3.8$\times$ acceleration in GEMM-$O$ (max peaking at about 87.5% of the theoretical limit). Applied with a multi-granularity sparsity strategy, it enables the Hunyuan model (33K) to achieve about 1.5$\times$ end-to-end acceleration without degrading visual quality.
