Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Katherine Crowson; Stefan Andreas Baumann; Alex Birch; Tanishq Mathew Abraham; Daniel Z. Kaplan; Enrico Shippole

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z. Kaplan, Enrico Shippole

TL;DR

This work introduces the Hourglass Diffusion Transformer (HDiT), a hierarchical pixel-space diffusion backbone that scales linearly with image pixel count and enables direct high-resolution generation without latent representations or multiscale tricks. By employing a multi-level hourglass structure, 2D RoPE positional encoding, global and neighborhood attention, GEGLU FFNs, and a learnable skip-merging mechanism, HDiT achieves $O(n)$ computational scaling and can generate megapixel images directly in pixel space. Ablation studies validate the architecture choices, and HDiT demonstrates strong performance on FFHQ-1024^2 and competitive results on ImageNet-256^2, including new state-of-the-art diffusion results for high-resolution pixel-space synthesis. The work suggests HDiT as a foundation for efficient high-resolution generation and potential extensions to latent-diffusion, super-resolution, and other modalities.

Abstract

We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$.

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

TL;DR

computational scaling and can generate megapixel images directly in pixel space. Ablation studies validate the architecture choices, and HDiT demonstrates strong performance on FFHQ-1024^2 and competitive results on ImageNet-256^2, including new state-of-the-art diffusion results for high-resolution pixel-space synthesis. The work suggests HDiT as a foundation for efficient high-resolution generation and potential extensions to latent-diffusion, super-resolution, and other modalities.

Abstract

We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g.

) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet

, and sets a new state-of-the-art for diffusion models on FFHQ-

Paper Structure (26 sections, 14 equations, 11 figures, 6 tables)

This paper contains 26 sections, 14 equations, 11 figures, 6 tables.

Introduction
Related Work
Transformers
High-Resolution Image Synthesis with Diffusion Models
Preliminaries
Diffusion Models
Hourglass Diffusion Transformers
Leveraging the Hierarchical Nature of Images
Hourglass Diffusion Transformer Block Design
Efficient Scaling to High Resolutions
Experiments
Experimental Setup
Effect of the Architecture
High-Resolution Pixel-Space Image Synthesis
Large-Scale ImageNet Image Synthesis
...and 11 more sections

Figures (11)

Figure 1: High-level overview of our *H HDiT architecture, specifically the version for ImageNet at input resolutions of $256^2$ at patch size $p = 4$, which has three levels. For any doubling in target resolution, another neighborhood attention block is added. "lerp" denotes a linear interpolation with learnable interpolation weight. All *H HDiT blocks have the noise level and the conditioning (embedded jointly using a mapping network) as additional inputs.
Figure 2: A comparison of our transformer block architecture and that used by DiT peebles2023dit.
Figure 3: A comparison of our pointwise feedforward block architecture and that used by DiT peebles2023dit.
Figure 4: Samples from our 85M-parameter FFHQ-$1024^2$ model. Best viewed zoomed in.
Figure 5: Samples from our class-conditional 557M-parameter ImageNet-$256^2$ model without classifier-free guidance.
...and 6 more figures

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

TL;DR

Abstract

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (11)