Table of Contents
Fetching ...

HilbertA: Hilbert Attention for Image Generation with Diffusion Models

Shaoyi Zheng, Wenbo Lu, Yuxuan Xia, Haomin Liu, Shengjie Wang

TL;DR

Diffusion transformers face quadratic self-attention costs at high resolutions, demanding patterns that preserve 2D locality without sacrificing hardware efficiency. HilbertA solves this by reordering image tokens along a Hilbert curve, partitioning into tiles for local attention, and applying a layer-wise sliding mechanism with a fixed central shared region, all implemented in Triton. It delivers substantial speedups—up to $2.3\times$ in attention and $4.17\times$ end-to-end at $2048\times2048$—while maintaining image quality comparable to or better than baselines, demonstrating hardware-aligned sparse attention for high-resolution image generation. The approach reduces memory I/O bottlenecks and incurs minimal reordering overhead, and its design generalizes to 3D for video diffusion, potentially broadening the practical impact of diffusion models while highlighting considerations for boundary artifacts and ethical use.

Abstract

Designing sparse attention for diffusion transformers requires reconciling two-dimensional spatial locality with GPU efficiency, a trade-off that current methods struggle to achieve. Existing approaches enforce two-dimensional spatial locality but often incur uncoalesced memory access. We present HilbertA, a 2D-aware and GPU-efficient sparse attention mechanism. HilbertA reorders image tokens along Hilbert curves to achieve a contiguous memory layout while preserving spatial neighborhoods, and employs a sliding schedule across layers to enable long-range information propagation without repeated or uncoalesced memory access. To further enhance cross-tile communication and positional awareness, HilbertA introduces a small central shared region. Implemented in Triton, HilbertA delivers comparable image quality with significant acceleration over prior methods on Flux.1-dev, demonstrating the feasibility of hardware-aligned two-dimensional sparse attention for high-resolution image generation. HilbertA delivers attention speedups of $2.3\times$ when generating $1024\times 1024$ images, and up to $4.17\times$ at $2048\times 2048$, while achieving image quality comparable to or surpassing baselines.

HilbertA: Hilbert Attention for Image Generation with Diffusion Models

TL;DR

Diffusion transformers face quadratic self-attention costs at high resolutions, demanding patterns that preserve 2D locality without sacrificing hardware efficiency. HilbertA solves this by reordering image tokens along a Hilbert curve, partitioning into tiles for local attention, and applying a layer-wise sliding mechanism with a fixed central shared region, all implemented in Triton. It delivers substantial speedups—up to in attention and end-to-end at —while maintaining image quality comparable to or better than baselines, demonstrating hardware-aligned sparse attention for high-resolution image generation. The approach reduces memory I/O bottlenecks and incurs minimal reordering overhead, and its design generalizes to 3D for video diffusion, potentially broadening the practical impact of diffusion models while highlighting considerations for boundary artifacts and ethical use.

Abstract

Designing sparse attention for diffusion transformers requires reconciling two-dimensional spatial locality with GPU efficiency, a trade-off that current methods struggle to achieve. Existing approaches enforce two-dimensional spatial locality but often incur uncoalesced memory access. We present HilbertA, a 2D-aware and GPU-efficient sparse attention mechanism. HilbertA reorders image tokens along Hilbert curves to achieve a contiguous memory layout while preserving spatial neighborhoods, and employs a sliding schedule across layers to enable long-range information propagation without repeated or uncoalesced memory access. To further enhance cross-tile communication and positional awareness, HilbertA introduces a small central shared region. Implemented in Triton, HilbertA delivers comparable image quality with significant acceleration over prior methods on Flux.1-dev, demonstrating the feasibility of hardware-aligned two-dimensional sparse attention for high-resolution image generation. HilbertA delivers attention speedups of when generating images, and up to at , while achieving image quality comparable to or surpassing baselines.

Paper Structure

This paper contains 35 sections, 8 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Image token layouts with their corresponding sparse attention patterns. Left: Image token layouts. Sparse patterns (Middle: Contiguous but not 2D-local; Right: 2D-local but not contiguous).
  • Figure 2: The HilbertA pipeline has three stages: (1) Reordering tokens along a Hilbert curve for contiguous memory layout; (2) Tiling tokens into local blocks for efficient intra-tile attention; and (3) Sliding the window by a fixed offset, enabling cross-tile interaction while preserving efficient memory access.
  • Figure 3: Comparison of space-filling curves. Left & Middle: Edge Average Stretch (EAS) & Geometric Distortion Error (GDE) across grid sizes. Hilbert curves achieve the lowest GDE and the second-lowest EAS, outperforming traversals and Morton order. Right: example orderings at the grid size of $2^3$.
  • Figure 4: Illustration of sliding tile access along the Hilbert-ordered sequence. Left: image tokens are reordered by the Hilbert curve and partitioned into tiles. Right: modular indexing enables contiguous memory access as the attention window advances, ensuring efficient cross-tile communication without uncoalesced reads.
  • Figure 5: Qualitative results at $1024\times1024$ resolution comparing Flux.1-dev, CLEAR, Sparge Attention, and HilbertA. HilbertA achieves image quality comparable to state-of-the-art baselines. More qualitative results are shown in the Appendix \ref{['appendix:B']}.
  • ...and 7 more figures