Table of Contents
Fetching ...

On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity

Quentin Bertrand, Anne Gagneux, Mathurin Massias, Rémi Emonet

TL;DR

The paper challenges the view that stochastic targets drive generalization in flow matching, showing that in high-dimensional data the stochasticity of the target contributes little to performance. By leveraging the closed-form optimal velocity field and introducing Empirical Flow Matching (EFM), it demonstrates that reducing target stochasticity—via a data-driven, unbiased estimator of the closed-form—yields stable or improved generalization on CIFAR-10 and CelebA. The key insight is that generalization is linked to the network's ability to approximate the closed-form velocity, with early-time dynamics playing a decisive role, rather than the presence of stochastic targets throughout training. These results suggest practical routes to more efficient and robust flow-based generative modeling, while highlighting the need to consider potential societal impacts of high-quality synthetic data.

Abstract

Modern deep generative models can now produce high-quality synthetic samples that are often indistinguishable from real training data. A growing body of research aims to understand why recent methods, such as diffusion and flow matching techniques, generalize so effectively. Among the proposed explanations are the inductive biases of deep learning architectures and the stochastic nature of the conditional flow matching loss. In this work, we rule out the noisy nature of the loss as a key factor driving generalization in flow matching. First, we empirically show that in high-dimensional settings, the stochastic and closed-form versions of the flow matching loss yield nearly equivalent losses. Then, using state-of-the-art flow matching models on standard image datasets, we demonstrate that both variants achieve comparable statistical performance, with the surprising observation that using the closed-form can even improve performance.

On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity

TL;DR

The paper challenges the view that stochastic targets drive generalization in flow matching, showing that in high-dimensional data the stochasticity of the target contributes little to performance. By leveraging the closed-form optimal velocity field and introducing Empirical Flow Matching (EFM), it demonstrates that reducing target stochasticity—via a data-driven, unbiased estimator of the closed-form—yields stable or improved generalization on CIFAR-10 and CelebA. The key insight is that generalization is linked to the network's ability to approximate the closed-form velocity, with early-time dynamics playing a decisive role, rather than the presence of stochastic targets throughout training. These results suggest practical routes to more efficient and robust flow-based generative modeling, while highlighting the need to consider potential societal impacts of high-quality synthetic data.

Abstract

Modern deep generative models can now produce high-quality synthetic samples that are often indistinguishable from real training data. A growing body of research aims to understand why recent methods, such as diffusion and flow matching techniques, generalize so effectively. Among the proposed explanations are the inductive biases of deep learning architectures and the stochastic nature of the conditional flow matching loss. In this work, we rule out the noisy nature of the loss as a key factor driving generalization in flow matching. First, we empirically show that in high-dimensional settings, the stochastic and closed-form versions of the flow matching loss yield nearly equivalent losses. Then, using state-of-the-art flow matching models on standard image datasets, we demonstrate that both variants achieve comparable statistical performance, with the surprising observation that using the closed-form can even improve performance.

Paper Structure

This paper contains 25 sections, 3 theorems, 24 equations, 4 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

When $p_{\mathrm{data}}$ is replaced by $\hat{p}_{\mathrm{data}}$, with the previous choices cond_item:1 and cond_item:2, the optimal velocity field $\hat{u}^\star$ in eq_inversion_formula has a closed-form formula: with $\lambda(x, t) = \mathrm{softmax} ( ( -\frac{\Vert x - t x^{(j)}\Vert^2}{2(1 -t)^2} )_{j=1,\ldots, n} ) \in \mathbb{R}^n.$

Figures (4)

  • Figure 1: We challenge the hypothesis that target stochasticity plays a major role in flow matching generalization. In \ref{['fig:hist_cosine']}, the histograms of the cosine similarities between $\hat{u}^\star((1-t) x_0 + t x_1, t)$ and $u^{\mathrm{cond}}((1-t) x_0 + t x_1, z=x_1, t) = x_1 - x_0$ are displayed for various time values $t$ and two datasets. For real, high-dimensional data, non-stochasticity arises very early (before $t = 0.2$ for CIFAR-10 with dimension $(3,32,32)$). \ref{['fig:collapse_times']} displays the alignment between $\hat{u}^\star$ and $u^{\mathrm{cond}}$ over time for varying image dimensions $d$ on Imagenette.
  • Figure 2: Failure to learn the optimal velocity field, CIFAR-10. Left: The leftmost figure represents the average error between the optimal empirical velocity field $\hat{u}^\star$ and the learned velocity $u_{\theta}$ for multiple values of time $t$. Middle: The middle figure displays the FID-10k computed on the test dataset, using the DINOv2 embedding. Right: The rightmost figure displays the average distance between the generated samples and their closest image from the training set -- for reference, the horizontal dashed line indicates the mean distance between an image of CIFAR-10 train and its nearest neighbor in the dataset. All the quantities are computed/learned on a varying number of training samples ($10$ to $10^4$) of the CIFAR-10 dataset.
  • Figure 3: Generalization occurs at small times on CIFAR-10 (left) and CelebA $64$ (right). Top: Generalization (distance between generated samples and training data) of hybrid models that follow $\hat{u}^\star$ on $[0, \tau]$, then $u_\theta$ on $[\tau, 1]$. The four colored curves correspond to four specific $x_0$, the black dashed curve is the mean distance over the 256 generated images. Bottom: visualization of generated images for the four different starting noises and various values of $\tau$ (the background color matching the curve in the top figure). Following $\hat{u}^\star$ until $\tau \geq 0.3$ yields a model that is not able to generalize.
  • Figure 4: FID computed on the training set ($50$k) and the test set ($10$k) using multiple embeddings, Inception and DINOv2. Regressing against a more deterministic target (EFM - $128$, $256$, $1000$) does not yield performance decreases. On the contrary, the more deterministic the target, the better the performance.

Theorems & Definitions (7)

  • Proposition 1: Closed-form Formula of the Optimal Velocity
  • Proposition 1
  • proof
  • Proposition 1
  • proof : Proof of Item \ref{['app_prop_minimizer']}.
  • proof : Proof of Item \ref{['app_prop_unbiased']}.
  • proof : Proof of Item \ref{['app_prop_variance']}.