Table of Contents
Fetching ...

On the Scalability of Diffusion-based Text-to-Image Generation

Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto

TL;DR

This work systematically analyzes how diffusion-based text-to-image models scale along two axes: denoising backbones (UNets vs transformers) and training data. Using fair, controlled experiments, it shows that SDXL-style UNets with carefully allocated cross-attention at low resolutions can outperform alternatives while offering strong efficiency; deeper transformer blocks also boost text-image alignment more parameter-efficiently than simply widening channels. Data quality and caption density—especially when augmented with synthetic captions—significantly improve both image fidelity and alignment, and combined datasets accelerate learning more than any single source. The study provides scaling laws linking compute, model size, and data, offering practical guidance for building more scalable, cost-effective diffusion-based T2I systems. Overall, the findings help define when and how to invest in model architecture versus data to push the Pareto frontier of image quality and efficiency.

Abstract

Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work, we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute and dataset size.

On the Scalability of Diffusion-based Text-to-Image Generation

TL;DR

This work systematically analyzes how diffusion-based text-to-image models scale along two axes: denoising backbones (UNets vs transformers) and training data. Using fair, controlled experiments, it shows that SDXL-style UNets with carefully allocated cross-attention at low resolutions can outperform alternatives while offering strong efficiency; deeper transformer blocks also boost text-image alignment more parameter-efficiently than simply widening channels. Data quality and caption density—especially when augmented with synthetic captions—significantly improve both image fidelity and alignment, and combined datasets accelerate learning more than any single source. The study provides scaling laws linking compute, model size, and data, offering practical guidance for building more scalable, cost-effective diffusion-based T2I systems. Overall, the findings help define when and how to invest in model architecture versus data to push the Pareto frontier of image quality and efficiency.

Abstract

Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work, we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute and dataset size.
Paper Structure (49 sections, 18 figures, 4 tables)

This paper contains 49 sections, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Pushing the Pareto frontier of the text-image alignment learning curve by efficiently scaling up both denoising backbones and training data. Comparing with the baseline SD2 UNet ldm, the combined scaling with both SDXL UNet and enlarged dataset significantly increases the performance and speeds up the convergence of TIFA score by 6$\times$.
  • Figure 2: Comparison of the UNet design between SD2 (left) and SDXL (right). SD2 applies cross-attention at all down-sampling levels, including 1$\times$, 2$\times$, 4$\times$ and 8$\times$, while SDXL adopts cross-attention only at 2$\times$ and 4$\times$ down-sampling levels.
  • Figure 3: The evolution of TIFA score during training with different UNets on the same dataset in terms of training steps and training compute (GFLOPs). The compute FLOPs is estimated with 3$\times$ FLOPs of single DDPM step $\times$ batch size $\times$ steps.
  • Figure 4: Evolution of TIFA score during training with scaled UNet variations. The baseline models are UNets of SD2 and SDXL. We train SDXL UNet variants with changes in (a) channels $C$ (b) transformer depth (TD) 3) both channels and TD.
  • Figure 5: Visualizing the effect of UNet scaling on text-image alignment. We change the UNet along two dimensions: channel number (left) and transformer depth (right). The prompts are: 1) "square blue apples on a tree with circular yellow leaves" 2) "five frosted glass bottles" 3) "a yellow box to the right of a blue sphere" 4) "the International Space Station flying in front of the moon"
  • ...and 13 more figures