Table of Contents
Fetching ...

Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation

Quan Dao, Hao Phung, Trung Dao, Dimitris Metaxas, Anh Tran

TL;DR

Self-Corrected Flow Distillation addresses sampling inefficiency in flow matching by integrating consistency distillation with adversarial training in latent space. The approach introduces a truncated consistency loss, a GAN-based one-step refinement, a reflow loss to align one-step and few-step trajectories, and a bidirectional consistency objective to stabilize cross-step generation. Empirical results on CelebA-HQ and zero-shot COCO demonstrate superior one-step and few-step FID scores and competitive CLIP metrics, with fast inference times. The work provides a practical pathway to real-time, consistent text-to-image and unconditional generation with public code release.

Abstract

Flow matching has emerged as a promising framework for training generative models, demonstrating impressive empirical performance while offering relative ease of training compared to diffusion-based models. However, this method still requires numerous function evaluations in the sampling process. To address these limitations, we introduce a self-corrected flow distillation method that effectively integrates consistency models and adversarial training within the flow-matching framework. This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling. Our extensive experiments validate the effectiveness of our method, yielding superior results both quantitatively and qualitatively on CelebA-HQ and zero-shot benchmarks on the COCO dataset. Our implementation is released at https://github.com/VinAIResearch/SCFlow

Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation

TL;DR

Self-Corrected Flow Distillation addresses sampling inefficiency in flow matching by integrating consistency distillation with adversarial training in latent space. The approach introduces a truncated consistency loss, a GAN-based one-step refinement, a reflow loss to align one-step and few-step trajectories, and a bidirectional consistency objective to stabilize cross-step generation. Empirical results on CelebA-HQ and zero-shot COCO demonstrate superior one-step and few-step FID scores and competitive CLIP metrics, with fast inference times. The work provides a practical pathway to real-time, consistent text-to-image and unconditional generation with public code release.

Abstract

Flow matching has emerged as a promising framework for training generative models, demonstrating impressive empirical performance while offering relative ease of training compared to diffusion-based models. However, this method still requires numerous function evaluations in the sampling process. To address these limitations, we introduce a self-corrected flow distillation method that effectively integrates consistency models and adversarial training within the flow-matching framework. This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling. Our extensive experiments validate the effectiveness of our method, yielding superior results both quantitatively and qualitatively on CelebA-HQ and zero-shot benchmarks on the COCO dataset. Our implementation is released at https://github.com/VinAIResearch/SCFlow

Paper Structure

This paper contains 15 sections, 9 equations, 15 figures, 6 tables, 2 algorithms.

Figures (15)

  • Figure 1: Illustration of consistent one-step and few-step image generation. Our method consistently delivers superior visual quality across different sampling steps, significantly surpassing the performance of the RectifiedFlow counterpart.
  • Figure 2: Qualitative results of our Distilled Text-to-Image diffusion model.
  • Figure 3: The overview of our Self-Corrected Flow Distillation method. All the latents are inputed as image for easier follow.
  • Figure 4: Trajectory of 10 NFEs Euler sampling of vanilla flow matching (teacher model) and CD model.
  • Figure 5: Varying NFEs on CelebA-HQ. Increasing NFEs accentuates details and sharpness in generated faces without oversaturation issues.
  • ...and 10 more figures