Table of Contents
Fetching ...

SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation

Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, Jun Zhang

TL;DR

To overcome the scalability challenge, implicit distribution alignment (IDA) is proposed to constrain the divergence between the generator and the fake distribution and intra-segment guidance (ISG) is proposed to relocate the timestep denoising importance from the teacher model.

Abstract

The Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion (SD) 1.5. However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX. In this paper, we first analyze the issues when applying vanilla DMD on large-scale models. Then, to overcome the scalability challenge, we propose implicit distribution alignment (IDA) to constrain the divergence between the generator and the fake distribution. Furthermore, we propose intra-segment guidance (ISG) to relocate the timestep denoising importance from the teacher model. With IDA alone, DMD converges for SD 3.5; employing both IDA and ISG, DMD converges for SD 3.5 and FLUX.1 dev. Together with a scaled VFM-based discriminator, our final model, dubbed \textbf{SenseFlow}, achieves superior performance in distillation for both diffusion based text-to-image models such as SDXL, and flow-matching models such as SD 3.5 Large and FLUX.1 dev. The source code is available at \href{https://github.com/XingtongGe/SenseFlow}{https://github.com/XingtongGe/SenseFlow}

SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation

TL;DR

To overcome the scalability challenge, implicit distribution alignment (IDA) is proposed to constrain the divergence between the generator and the fake distribution and intra-segment guidance (ISG) is proposed to relocate the timestep denoising importance from the teacher model.

Abstract

The Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion (SD) 1.5. However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX. In this paper, we first analyze the issues when applying vanilla DMD on large-scale models. Then, to overcome the scalability challenge, we propose implicit distribution alignment (IDA) to constrain the divergence between the generator and the fake distribution. Furthermore, we propose intra-segment guidance (ISG) to relocate the timestep denoising importance from the teacher model. With IDA alone, DMD converges for SD 3.5; employing both IDA and ISG, DMD converges for SD 3.5 and FLUX.1 dev. Together with a scaled VFM-based discriminator, our final model, dubbed \textbf{SenseFlow}, achieves superior performance in distillation for both diffusion based text-to-image models such as SDXL, and flow-matching models such as SD 3.5 Large and FLUX.1 dev. The source code is available at \href{https://github.com/XingtongGe/SenseFlow}{https://github.com/XingtongGe/SenseFlow}

Paper Structure

This paper contains 42 sections, 6 theorems, 40 equations, 18 figures, 11 tables, 1 algorithm.

Key Result

Proposition 3.1

Under mild assumptions (Assumptions ass:reg and ass:fisher), IDA maintains an $\epsilon$-best inner response. More specifically, after $k$ round of min max optimization in Eq. eq:dmd_minmax2, we have

Figures (18)

  • Figure 1: 1024×1024 samples produced by our 4-step generator distilled from FLUX.1-dev. Please zoom in for details.
  • Figure 2: Left: The generator $\mathcal{G}$ receives a text prompt and $x_{\tau_i}$ to produce one-step output $x_g$, which is diffused to $x_t$ and processed by $s_{\phi}$ and $s_r$ for computing DMD gradient. ISG guides $\mathcal{G}$ using an sampled intermediate ${t_{mid}}$, and IDA aligns $\mathcal{G}$ with $s_{\phi}$ after generator update. The overall training pipeline is shown in Algorithm \ref{['alg:senseflow']}. Right: The discriminator extracts semantic features from generated and real images using CLIP and DINOv2, which are processed by head blocks ${h_{\theta_i}}$ to predict real/fake logits for adversarial training. Trainable modules are shown in pink, while frozen (pretrained) ones are shown in grey.
  • Figure 3: "Training Hours-FID" curves on COCO-5K dataset. When distilling the 8B SD 3.5 Large, IDA improves training stability across TTUR ratios.
  • Figure 4: Left: Normalized reconstruction loss $\xi(t)$ over timesteps in $[0, 1]$. Right: Illustration of the Intra-Segment Guidance.
  • Figure 5: Qualitative comparisons on challenging prompts across methods. Our method shows superior fidelity, especially in rendering human faces, scene composition, and fine-grained textures.
  • ...and 13 more figures

Theorems & Definitions (11)

  • Proposition 3.1
  • Proposition A.2: Cross-entropy decomposition and best response
  • proof
  • Lemma A.3: Field-gap bound under IDA using $e_k$
  • proof
  • Lemma A.4: Coupled one-step recursion for tracking
  • proof
  • Proposition A.5: Asymptotic bound for the tracking error
  • proof
  • Proposition A.7: From field error to $\varepsilon$-best response
  • ...and 1 more