Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Hao Li; Shamit Lal; Zhiheng Li; Yusheng Xie; Ying Wang; Yang Zou; Orchid Majumder; R. Manmatha; Zhuowen Tu; Stefano Ermon; Stefano Soatto; Ashwin Swaminathan

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Hao Li, Shamit Lal, Zhiheng Li, Yusheng Xie, Ying Wang, Yang Zou, Orchid Majumder, R. Manmatha, Zhuowen Tu, Stefano Ermon, Stefano Soatto, Ashwin Swaminathan

TL;DR

This paper provides a comprehensive, large-scale comparison of diffusion backbones for text-to-image generation, spanning $0.3B$ to $8B$ parameters and datasets up to $600M$ images. It shows that a pure self-attention backbone, U-ViT, scales more effectively than cross-attention DiTs and can outperform state-of-the-art SDXL UNet in controlled experiments, while data strategies like long captions further boost text-image alignment. The work details a fair experimental framework, revealing that 2.3B U-ViT achieves strong performance with significantly lower end-to-end latency than comparable UNet-based models, and highlights that data scale and information density in captions are key drivers of alignment quality. The findings suggest practical benefits for extending diffusion models to image editing and potentially other modalities, given the straightforward token-based conditioning of U-ViT. Overall, the study emphasizes architecture design and data strategy as central to efficient scaling in diffusion-based T2I systems.

Abstract

We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency.

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

TL;DR

This paper provides a comprehensive, large-scale comparison of diffusion backbones for text-to-image generation, spanning

parameters and datasets up to

images. It shows that a pure self-attention backbone, U-ViT, scales more effectively than cross-attention DiTs and can outperform state-of-the-art SDXL UNet in controlled experiments, while data strategies like long captions further boost text-image alignment. The work details a fair experimental framework, revealing that 2.3B U-ViT achieves strong performance with significantly lower end-to-end latency than comparable UNet-based models, and highlights that data scale and information density in captions are key drivers of alignment quality. The findings suggest practical benefits for extending diffusion models to image editing and potentially other modalities, given the straightforward token-based conditioning of U-ViT. Overall, the study emphasizes architecture design and data strategy as central to efficient scaling in diffusion-based T2I systems.

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

TL;DR

Abstract

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)