Table of Contents
Fetching ...

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Trung Dao, Thuan Hoang Nguyen, Thanh Le, Duc Vu, Khoi Nguyen, Cuong Pham, Anh Tran

TL;DR

SwiftBrush v2 tackles the challenge of surpassing its multi-step teacher in one-step diffusion by analyzing the quality-diversity trade-off, initializing with SD Turbo weights, and integrating a clamped CLIP loss along with data-scale and resource-efficient training. By fusing two training schemes through simple weight interpolation and applying post-training image regularization, the method achieves state-of-the-art one-step FID on COCO-2014 (FID = 8.14) while maintaining near real-time inference. The approach demonstrates strong image quality, textual alignment, and diversity, outperforming GAN-based and prior one-step methods, with robust ablations and practical training strategies (LoRA, TinyVAE, ScaleCrafter). The work also provides insights into robustness, compositional improvements, and scalable data utilization, offering a practical path toward accessible, high-quality on-device text-to-image synthesis. Overall, SwiftBrush v2 extends the capabilities of one-step diffusion models, enabling faster, more diverse, and higher-fidelity image generation with scalable training and post-hoc fusion techniques.

Abstract

In this paper, we aim to enhance the performance of SwiftBrush, a prominent one-step text-to-image diffusion model, to be competitive with its multi-step Stable Diffusion counterpart. Initially, we explore the quality-diversity trade-off between SwiftBrush and SD Turbo: the former excels in image diversity, while the latter excels in image quality. This observation motivates our proposed modifications in the training methodology, including better weight initialization and efficient LoRA training. Moreover, our introduction of a novel clamped CLIP loss enhances image-text alignment and results in improved image quality. Remarkably, by combining the weights of models trained with efficient LoRA and full training, we achieve a new state-of-the-art one-step diffusion model, achieving an FID of 8.14 and surpassing all GAN-based and multi-step Stable Diffusion models. The project page is available at https://swiftbrushv2.github.io.

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

TL;DR

SwiftBrush v2 tackles the challenge of surpassing its multi-step teacher in one-step diffusion by analyzing the quality-diversity trade-off, initializing with SD Turbo weights, and integrating a clamped CLIP loss along with data-scale and resource-efficient training. By fusing two training schemes through simple weight interpolation and applying post-training image regularization, the method achieves state-of-the-art one-step FID on COCO-2014 (FID = 8.14) while maintaining near real-time inference. The approach demonstrates strong image quality, textual alignment, and diversity, outperforming GAN-based and prior one-step methods, with robust ablations and practical training strategies (LoRA, TinyVAE, ScaleCrafter). The work also provides insights into robustness, compositional improvements, and scalable data utilization, offering a practical path toward accessible, high-quality on-device text-to-image synthesis. Overall, SwiftBrush v2 extends the capabilities of one-step diffusion models, enabling faster, more diverse, and higher-fidelity image generation with scalable training and post-hoc fusion techniques.

Abstract

In this paper, we aim to enhance the performance of SwiftBrush, a prominent one-step text-to-image diffusion model, to be competitive with its multi-step Stable Diffusion counterpart. Initially, we explore the quality-diversity trade-off between SwiftBrush and SD Turbo: the former excels in image diversity, while the latter excels in image quality. This observation motivates our proposed modifications in the training methodology, including better weight initialization and efficient LoRA training. Moreover, our introduction of a novel clamped CLIP loss enhances image-text alignment and results in improved image quality. Remarkably, by combining the weights of models trained with efficient LoRA and full training, we achieve a new state-of-the-art one-step diffusion model, achieving an FID of 8.14 and surpassing all GAN-based and multi-step Stable Diffusion models. The project page is available at https://swiftbrushv2.github.io.
Paper Structure (29 sections, 6 equations, 19 figures, 9 tables)

This paper contains 29 sections, 6 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Our one-step diffusion model achieves an impressive FID of 8.14, generating high-quality and diverse results with a single UNet forwarding. The example images generated from the "A laughing cute grey rabbit with white stripe on the head, piles of gold coins in background, colorful, Disney Picture render, photorealistic" (first two rows) and "Portrait of a woman looking at the camera" (last two rows) prompts demonstrate our model's ability to create fast, visually appealing, and varied outputs.
  • Figure 2: SwiftBrush v2 overview: two versions of the student model: a fully finetuned model trained with the Variational Score Distillation (VSD) loss, and a LoRA finetuned model trained with both VSD and CLIP loss. The final model is obtained by merging the two student models, leveraging the strengths of both training schemes.
  • Figure 3: The effect of weight interpolation upon FID, CLIP score, precision, and recall calculated on the zero-shot MS COCO-2014 benchmark. 0.0 indicates SD Turbo, and 1.0 indicates the original SwiftBrush.
  • Figure 4: User survey. We asked participants to compare the quality and diversity of images generated by our method and its teacher model across 20 random text prompts.
  • Figure 5: Exemplified images generated by SD Turbo, SwiftBrush, SDv2.1 with 50 sampling steps, InstaFlow-0.9B and Ours. Images in the same row are sampled from the same text prompt, while images in the same column are from the same model.
  • ...and 14 more figures