Score Distillation of Flow Matching Models
Mingyuan Zhou, Yi Gu, Huangjie Zheng, Liangchen Song, Guande He, Yizhe Zhang, Wenze Hu, Yinfei Yang
TL;DR
The paper addresses slow diffusion sampling by presenting a unified Gaussian-based perspective that links diffusion and flow matching. It derives Tweedie's formula to show equivalence of various targets (x0-, ε-, v-prediction) and analyzes loss weighting as the practical differentiator, avoiding ODE/SDE formulations. Building on this, it extends Score identity Distillation (SiD) to DiT-based flow-matching models and demonstrates data-free and data-aided distillation across SANA, SD3/SD3.5, and FLUX.1-dev to produce four-step generators without teacher finetuning, using a single codebase. The results establish a robust, general framework for accelerating flow- and diffusion-based text-to-image generation and bridge theoretical gaps between the two paradigms.
Abstract
Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. A project page is available at https://yigu1008.github.io/SiD-DiT.
