On the Scalability of Diffusion-based Text-to-Image Generation
Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto
TL;DR
This work systematically analyzes how diffusion-based text-to-image models scale along two axes: denoising backbones (UNets vs transformers) and training data. Using fair, controlled experiments, it shows that SDXL-style UNets with carefully allocated cross-attention at low resolutions can outperform alternatives while offering strong efficiency; deeper transformer blocks also boost text-image alignment more parameter-efficiently than simply widening channels. Data quality and caption density—especially when augmented with synthetic captions—significantly improve both image fidelity and alignment, and combined datasets accelerate learning more than any single source. The study provides scaling laws linking compute, model size, and data, offering practical guidance for building more scalable, cost-effective diffusion-based T2I systems. Overall, the findings help define when and how to invest in model architecture versus data to push the Pareto frontier of image quality and efficiency.
Abstract
Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work, we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute and dataset size.
