Table of Contents
Fetching ...

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

Vikash Sehwag, Xianghao Kong, Jingtao Li, Michael Spranger, Lingjuan Lyu

TL;DR

This work tackles the barrier to developing large-scale text-to-image diffusion models by introducing a micro-budget training pipeline that leverages deferred patch masking, a lightweight patch-mixer, and mixture-of-experts to dramatically reduce compute. By pre-processing patch embeddings before masking and coupling this with layer-wise scaling and MoE, the authors train a 1.16B sparse diffusion transformer on 37M images for ~\$1,890, achieving a zero-shot COCO FID of 12.7 and substantially lower costs than prior methods. They also show that incorporating synthetic data improves alignment with human preferences, with GPT-4o-based evaluations favoring combined real+synthetic data. The paper demonstrates that high-quality diffusion models can be trained on micro-budgets using open data and releases an end-to-end training pipeline to democratize access to large-scale diffusion modeling.

Abstract

As scaling laws in generative AI push performance, they also simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to address this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose to randomly mask up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only \$1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118$\times$ lower cost than stable diffusion models and 14$\times$ lower cost than the current state-of-the-art approach that costs \$28,400. We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

TL;DR

This work tackles the barrier to developing large-scale text-to-image diffusion models by introducing a micro-budget training pipeline that leverages deferred patch masking, a lightweight patch-mixer, and mixture-of-experts to dramatically reduce compute. By pre-processing patch embeddings before masking and coupling this with layer-wise scaling and MoE, the authors train a 1.16B sparse diffusion transformer on 37M images for ~\$1,890, achieving a zero-shot COCO FID of 12.7 and substantially lower costs than prior methods. They also show that incorporating synthetic data improves alignment with human preferences, with GPT-4o-based evaluations favoring combined real+synthetic data. The paper demonstrates that high-quality diffusion models can be trained on micro-budgets using open data and releases an end-to-end training pipeline to democratize access to large-scale diffusion modeling.

Abstract

As scaling laws in generative AI push performance, they also simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to address this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose to randomly mask up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only \\times\times28,400. We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.
Paper Structure (22 sections, 6 equations, 27 figures, 6 tables)

This paper contains 22 sections, 6 equations, 27 figures, 6 tables.

Figures (27)

  • Figure 1: Qualitative evaluation of the image generation capabilities of our model ($512\times512$ image resolution) . Our model is trained in $2.6$ days on a single 8$\times$H100 machine (amounting to only $1,890 in GPU cost) without any proprietary or billion image dataset. In (a)-(c) we examine diverse style generation capabilities using prompt 'Image of an astronaut riding a horse in style', with following styles: Origami, Pixel art, Line art, Cyberpunk, and Van Gogh Starry Night. In (d) we compare training cost and fidelity, measured using FID heusel2017FIDgans for zero-shot image generation on COCO lin2014microsoftCOCO dataset, of all models.
  • Figure 2: Compressing patch sequence to reduce computational cost. As the training cost of diffusion transformers is proportional to sequence size, i.e., number of patches, it is desirable to reduce the sequence size without degrading performance. It can be achieved by a) using larger patches, b) naively masking a fraction of patches at random, or c) using MaskDiT zheng2024maskDit that combines naive masking with an additional autoencoding objective. We find all three approaches lead to significant degradation in image generation performance, especially at high masking ratios. To alleviate this issue, we propose a straightforward deferred masking strategy, where we mask patches after they are processed by a patch-mixer. Our approach is analogous to naive masking in all aspects except the use of the patch-mixer. In comparison to MaskDiT, our approach doesn't require optimizing any surrogate objectives and has nearly identical computational costs.
  • Figure 3: Overall architecture of our diffusion transformer. We prepend the backbone transformer model with a lightweight patch-mixer that operates on all patches in the input image before they are masked. Following contemporary works betker2023DallE3esser2024sd3, we process the caption embeddings using an attention layer before using them for conditioning. We use sinusoidal embeddings for timesteps. Our model only denoises unmasked patches, thus the diffusion loss (Eq. \ref{['eq: \n loss']}) is calculated only on these patches. We modify the backbone transformer using layer-wise scaling on individual layers and use mixture-of-expert layers in alternate transformer blocks.
  • Figure 4: Out-of-the-box performance of deferred masking. Without any hyperparameter optimization, we compare the performance of our deferred masking with a naive masking strategy. We find that deferred masking, i.e., using a patch-mixer before naive masking, tremendously improves image generation performance, particularly at high masking ratios.
  • Figure 5: Comparing performance of patch masking strategies. Using a lightweight patch-mixer before patch masking in our deferred masking approach significantly improves image generation performance over baseline masking strategies. Our approach incurs near identical training cost as the MaskDiT zheng2024maskDit baseline. However, both approaches incur slightly higher cost than naive masking due the use of an additional lightweight transformer along with the backbone diffusion transformer.
  • ...and 22 more figures