Table of Contents
Fetching ...

Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim

TL;DR

This work addresses the limitation of fixed Gaussian sources in flow matching for conditional generation, proposing CSFM which learns a condition-dependent source distribution via a source generator. The method combines a conditional Gaussian with variance regularization and a directional alignment term to stabilize learning and better exploit conditioning signals, enabling end-to-end training with the FM objective. Empirically, CSFM yields faster convergence in FID and CLIP scores, straighter transport paths, and superior performance relative to prior condition-aware couplings across multiple T2I benchmarks and scales. The results show practical gains in training dynamics and generation quality, especially when target representations exhibit structured latent geometry.

Abstract

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.

Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

TL;DR

This work addresses the limitation of fixed Gaussian sources in flow matching for conditional generation, proposing CSFM which learns a condition-dependent source distribution via a source generator. The method combines a conditional Gaussian with variance regularization and a directional alignment term to stabilize learning and better exploit conditioning signals, enabling end-to-end training with the FM objective. Empirically, CSFM yields faster convergence in FID and CLIP scores, straighter transport paths, and superior performance relative to prior condition-aware couplings across multiple T2I benchmarks and scales. The results show practical gains in training dynamics and generation quality, especially when target representations exhibit structured latent geometry.

Abstract

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.
Paper Structure (44 sections, 19 equations, 16 figures, 9 tables)

This paper contains 44 sections, 19 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Condition-dependent Source Flow Matching (CSFM). Flow matching does not require the source distribution to be a fixed standard Gaussian. We leverage this flexibility by learning a condition-dependent source distribution, which reduces intrinsic variance and improves conditional flow matching performance.
  • Figure 2: Analysis of CSFM designs. We investigate the effect of the source designs using two two-dimensional synthetic datasets with continuous conditions: Eight Gaussians with polar angle condition and Two Moons with $x$-coordinate condition. We visualize the transport trajectories, where '$\boldsymbol{\times}$' denotes source points $X_0$ and '$\bullet$' denotes points $X_1^{\text{sampled}}$ generated by the flow model. Colors indicate the conditioning variable. (A) Fixed Standard Gaussian: Independent coupling results in entangled paths and high intrinsic variance. (B) Deterministic Mapping: The flow model with a deterministically mapped source fails to reconstruct the original target distribution. (C) Conditional Gaussian: Although the source is modeled as a condition-dependent Gaussian, its variance collapses during training, resulting in insufficient support and an inability to recover the target distribution. (D) Conditional Gaussian with Standard KL regularization: While preventing collapse to a deterministic mapping, the constraint on $\mu_\phi(C)$ limits the mobility of the source, yielding entangled trajectories. (E) Conditional Gaussian with Variance Regularization: Variance Regularization prevents collapse while allowing the conditional mode $\mu_\phi(C)$ to move, resulting target-aligned source distribution and disentangled trajectories.
  • Figure 3: Flow matching loss and gradient variance. ($\mathrm{Var}(\nabla_\theta \mathcal{L}_{\text{FM}})$). We compare the training dynamics of standard FM, CSFM without alignment loss, and CSFM. CSFM achieves faster loss convergence and lower gradient variance, particularly at early interpolation times near the source. Details of the measurement are provided in Appx. \ref{['subsec:appx_gradvar']}.
  • Figure 4: Training efficiency under different target representations. We compare FID and CLIP score trajectories between CSFM and FM, using (A) SD-VAE and (B) RAE (DINOv2) target representations on the ImageNet-1K validation set. While CSFM yields consistent gains under both representations, it substantially accelerates convergence and achieves larger improvements in the structured RAE space.
  • Figure 5: Few-step generation and flow straightness. We compare FID across different sampling steps for (A) Flow Matching and (B) 1-Reflow. CSFM degrades more gracefully as the number of steps decreases, indicating reduced path intersections and a straighter transport field compared to the FM baseline.
  • ...and 11 more figures