Table of Contents
Fetching ...

FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers

Liang Qiao, Yue Dai, Yeqi Huang, Hongyu Kan, Jun Shi, Hong An

TL;DR

This work tackles the computational bottleneck of diffusion transformers by introducing FlashOmni, a unified sparse attention engine that accommodates arbitrary sparsity patterns. It introduces an Update-Dispatch paradigm, compact 8-bit sparse symbols $S_c$ and $S_s$, and general sparse attention plus sparse GEMMs (GEMM-$Q$ and GEMM-$O$) to support multi-granularity sparsity. Empirical results show near-linear speedups and substantial end-to-end gains across multiple diffusion models while preserving visual quality, with up to 2.5–3.8× speedups in GEMM-$O$ and about 1.5× end-to-end acceleration on HunyuanVideo. The proposed framework enables scalable, real-time inference for high-resolution image and video diffusion tasks on commodity GPUs without retraining, by unifying sparsity strategies into a single kernel and optimized linear layers.

Abstract

Multi-Modal Diffusion Transformers (DiTs) demonstrate exceptional capabilities in visual synthesis, yet their deployment remains constrained by substantial computational demands. To alleviate this bottleneck, many sparsity-based acceleration methods have been proposed. However, their diverse sparsity patterns often require customized kernels for high-performance inference, limiting universality. We propose FlashOmni, a unified sparse attention engine compatible with arbitrary DiT architectures. FlashOmni introduces flexible sparse symbols to standardize the representation of a wide range of sparsity strategies, such as feature caching and block-sparse skipping. This unified abstraction enables the execution of diverse sparse computations within a single attention kernel. In addition, FlashOmni designs optimized sparse GEMMs for attention blocks, leveraging sparse symbols to eliminate redundant computations and further improve efficiency. Experiments demonstrate that FlashOmni delivers near-linear, closely matching the sparsity ratio speedup (1:1) in attention and GEMM-$Q$, and achieves 2.5$\times$-3.8$\times$ acceleration in GEMM-$O$ (max peaking at about 87.5% of the theoretical limit). Applied with a multi-granularity sparsity strategy, it enables the Hunyuan model (33K) to achieve about 1.5$\times$ end-to-end acceleration without degrading visual quality.

FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers

TL;DR

This work tackles the computational bottleneck of diffusion transformers by introducing FlashOmni, a unified sparse attention engine that accommodates arbitrary sparsity patterns. It introduces an Update-Dispatch paradigm, compact 8-bit sparse symbols and , and general sparse attention plus sparse GEMMs (GEMM- and GEMM-) to support multi-granularity sparsity. Empirical results show near-linear speedups and substantial end-to-end gains across multiple diffusion models while preserving visual quality, with up to 2.5–3.8× speedups in GEMM- and about 1.5× end-to-end acceleration on HunyuanVideo. The proposed framework enables scalable, real-time inference for high-resolution image and video diffusion tasks on commodity GPUs without retraining, by unifying sparsity strategies into a single kernel and optimized linear layers.

Abstract

Multi-Modal Diffusion Transformers (DiTs) demonstrate exceptional capabilities in visual synthesis, yet their deployment remains constrained by substantial computational demands. To alleviate this bottleneck, many sparsity-based acceleration methods have been proposed. However, their diverse sparsity patterns often require customized kernels for high-performance inference, limiting universality. We propose FlashOmni, a unified sparse attention engine compatible with arbitrary DiT architectures. FlashOmni introduces flexible sparse symbols to standardize the representation of a wide range of sparsity strategies, such as feature caching and block-sparse skipping. This unified abstraction enables the execution of diverse sparse computations within a single attention kernel. In addition, FlashOmni designs optimized sparse GEMMs for attention blocks, leveraging sparse symbols to eliminate redundant computations and further improve efficiency. Experiments demonstrate that FlashOmni delivers near-linear, closely matching the sparsity ratio speedup (1:1) in attention and GEMM-, and achieves 2.5-3.8 acceleration in GEMM- (max peaking at about 87.5% of the theoretical limit). Applied with a multi-granularity sparsity strategy, it enables the Hunyuan model (33K) to achieve about 1.5 end-to-end acceleration without degrading visual quality.

Paper Structure

This paper contains 24 sections, 5 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Visualization of different acceleration methods on FLUX and FlashOmni's speedup on HunyuanVideo.
  • Figure 2: Typical sparse methods for DiTs.
  • Figure 3: FlashOmni design.
  • Figure 4: Detailed workflow of the FlashOmni framework: incorporating unified sparse symbols and sparse kernels (general sparse attention and GEMMs). Unified sparse symbols are refreshed only at Update timesteps, providing sparse guidance for corresponding sparse kernel executions at Dispatch timesteps.
  • Figure 5: Example of FlashOmni sparse symbols generation for a single head of attention.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Definition 1: Logical Block Sparse Masks
  • Definition 2: Sparse Strategies