Table of Contents
Fetching ...

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, Marc Aubreville

TL;DR

Würstchen tackles the high computational demands of large-scale text-to-image diffusion by introducing a three-stage latent-diffusion architecture with a 42:1 semantic latent compression. By decoupling text-conditioned generation from high-resolution rendering and training in reverse, it achieves competitive fidelity at roughly an 8x reduction in training cost compared with SD2.1, while also improving inference speed. The approach is validated through automated metrics, human preferences, and efficiency analyses on COCO, Localized Narratives, and Parti-prompts, and it is released as open-source. This work highlights a path toward more accessible, sustainable, and scalable diffusion-based image synthesis without sacrificing perceptual quality.

Abstract

We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

TL;DR

Würstchen tackles the high computational demands of large-scale text-to-image diffusion by introducing a three-stage latent-diffusion architecture with a 42:1 semantic latent compression. By decoupling text-conditioned generation from high-resolution rendering and training in reverse, it achieves competitive fidelity at roughly an 8x reduction in training cost compared with SD2.1, while also improving inference speed. The approach is validated through automated metrics, human preferences, and efficiency analyses on COCO, Localized Narratives, and Parti-prompts, and it is released as open-source. This work highlights a path toward more accessible, sustainable, and scalable diffusion-based image synthesis without sacrificing perceptual quality.

Abstract

We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.
Paper Structure (40 sections, 10 equations, 17 figures, 3 tables)

This paper contains 40 sections, 10 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Text-conditional generations using Würstchen. Note the various art styles and aspect ratios.
  • Figure 2: Inference architecture for text-conditional image generation.
  • Figure 3: Training objectives of our model. Initially, a VQGAN is trained. Secondly, Stage B is trained as a diffusion model inside Stage A's latent space. Stage B is conditioned on text-embeddings and the output of the Semantic Compressor, which produces strongly downsampled latent representations of the same image. Finally, Stage C is trained on the latents of the Semantic Compressor as a text-conditional LDM, effectively operating on a compression ratio of $42:1$.
  • Figure 4: Inference time for $1024\times 1024$ images on an A100-GPUs. Left plot shows performance without specific optimization, right plot shows performance using torch.compile().
  • Figure 5: Overall human preferences (left) and by users (middle). The preference by users considered only users with a large number of comparisons (right).
  • ...and 12 more figures