HilbertA: Hilbert Attention for Image Generation with Diffusion Models
Shaoyi Zheng, Wenbo Lu, Yuxuan Xia, Haomin Liu, Shengjie Wang
TL;DR
Diffusion transformers face quadratic self-attention costs at high resolutions, demanding patterns that preserve 2D locality without sacrificing hardware efficiency. HilbertA solves this by reordering image tokens along a Hilbert curve, partitioning into tiles for local attention, and applying a layer-wise sliding mechanism with a fixed central shared region, all implemented in Triton. It delivers substantial speedups—up to $2.3\times$ in attention and $4.17\times$ end-to-end at $2048\times2048$—while maintaining image quality comparable to or better than baselines, demonstrating hardware-aligned sparse attention for high-resolution image generation. The approach reduces memory I/O bottlenecks and incurs minimal reordering overhead, and its design generalizes to 3D for video diffusion, potentially broadening the practical impact of diffusion models while highlighting considerations for boundary artifacts and ethical use.
Abstract
Designing sparse attention for diffusion transformers requires reconciling two-dimensional spatial locality with GPU efficiency, a trade-off that current methods struggle to achieve. Existing approaches enforce two-dimensional spatial locality but often incur uncoalesced memory access. We present HilbertA, a 2D-aware and GPU-efficient sparse attention mechanism. HilbertA reorders image tokens along Hilbert curves to achieve a contiguous memory layout while preserving spatial neighborhoods, and employs a sliding schedule across layers to enable long-range information propagation without repeated or uncoalesced memory access. To further enhance cross-tile communication and positional awareness, HilbertA introduces a small central shared region. Implemented in Triton, HilbertA delivers comparable image quality with significant acceleration over prior methods on Flux.1-dev, demonstrating the feasibility of hardware-aligned two-dimensional sparse attention for high-resolution image generation. HilbertA delivers attention speedups of $2.3\times$ when generating $1024\times 1024$ images, and up to $4.17\times$ at $2048\times 2048$, while achieving image quality comparable to or surpassing baselines.
