Table of Contents
Fetching ...

Score Distillation of Flow Matching Models

Mingyuan Zhou, Yi Gu, Huangjie Zheng, Liangchen Song, Guande He, Yizhe Zhang, Wenze Hu, Yinfei Yang

TL;DR

The paper addresses slow diffusion sampling by presenting a unified Gaussian-based perspective that links diffusion and flow matching. It derives Tweedie's formula to show equivalence of various targets (x0-, ε-, v-prediction) and analyzes loss weighting as the practical differentiator, avoiding ODE/SDE formulations. Building on this, it extends Score identity Distillation (SiD) to DiT-based flow-matching models and demonstrates data-free and data-aided distillation across SANA, SD3/SD3.5, and FLUX.1-dev to produce four-step generators without teacher finetuning, using a single codebase. The results establish a robust, general framework for accelerating flow- and diffusion-based text-to-image generation and bridge theoretical gaps between the two paradigms.

Abstract

Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. A project page is available at https://yigu1008.github.io/SiD-DiT.

Score Distillation of Flow Matching Models

TL;DR

The paper addresses slow diffusion sampling by presenting a unified Gaussian-based perspective that links diffusion and flow matching. It derives Tweedie's formula to show equivalence of various targets (x0-, ε-, v-prediction) and analyzes loss weighting as the practical differentiator, avoiding ODE/SDE formulations. Building on this, it extends Score identity Distillation (SiD) to DiT-based flow-matching models and demonstrates data-free and data-aided distillation across SANA, SD3/SD3.5, and FLUX.1-dev to produce four-step generators without teacher finetuning, using a single codebase. The results establish a robust, general framework for accelerating flow- and diffusion-based text-to-image generation and bridge theoretical gaps between the two paradigms.

Abstract

Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. A project page is available at https://yigu1008.github.io/SiD-DiT.

Paper Structure

This paper contains 20 sections, 28 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Qualitative results produced by the four-step SiD-DiT generator distilled from SD3.5-Large.
  • Figure 2: The first row shows density plots of various noise schedules mapped to $t \in (0, 1)$ by aligning their signal-to-noise ratio (SNR), $\text{SNR}_t = \alpha_t^2 / \sigma_t^2$, with $(1 - t)^2 / t^2$, which corresponds to setting $t = 1 / (1 + \sqrt{\text{SNR}_t})$. The remaining rows show the weight-normalized distribution of $t$ under different weighting schemes: $1/t$, $1 - t$, $(1 - t)^2$, $(1 - t)/t$, and $(1 - t)^2/t^2$. The first column corresponds to the default schedule used in this paper and in TrigFlow training of SANA-Sprint; the second to the default TrigFlow schedule; the third to the discretized schedule of SANA; the fourth to the DDPM beta linear schedule; the fifth to EDM's training schedule; and the sixth to EDM's sampling schedule restricted to $t < 0.8$, as in SiD for score distillation.
  • Figure 3: Comparison of distilled Sana_600M_512px_diffusers by restricting $t$ to different ranges. The text prompts are: 'a dog and a cat laying on the red carpet on the floor.', 'an old blue car with a surfboard on top', 'a lady is about to put an automatic tooth brush in her mouth', and 'a good luck plant is in a round vase.'
  • Figure 4: This plot shows the evolution of FID (solid lines, left y-axis) and CLIP score (matching line styles with reduced opacity, right y-axis) as a function of the number of iterated images (in thousands) for SiD-DiT. Because the x-axis is log-scaled, the near-linear trends in many panels reflect a rapid initial decline in FID accompanied by a corresponding rise in CLIP score, followed by progressively smaller gains as training continues. This consistent behavior across architectures and model sizes shows that SiD-DiT quickly improves both image fidelity and semantic alignment during the early stages of distillation.
  • Figure 5: Qualitative results produced by the four-step SiD-DiT generator distilled from $\textsc{FLUX-1.DEV}$.
  • ...and 8 more figures