Table of Contents
Fetching ...

How Do Flow Matching Models Memorize and Generalize in Sample Data Subspaces?

Weiguo Gao, Ming Li

TL;DR

This work analyzes how Flow Matching models memorize and generalize when real data is treated as a discrete, low-dimensional subspace within high-dimensional space. It derives an explicit optimal velocity field under a Gaussian prior, showing generation paths memorize real data points via a softmax-weighted blend toward the data, thereby exactly representing the sample subspace. To handle suboptimal settings, it introduces OSDNet, decomposing velocity fields into subspace and off-subspace parts, proving the off-subspace component decays while the subspace component generalizes within the data subspace; a teacher-student training framework decouples these terms and yields bounds linking generation accuracy to training loss. Collectively, the results illuminate how to preserve proximity and diversity to the data subspace in Flow Matching, guiding principled design of models with robust subspace generalization and memorization characteristics. These insights have practical implications for efficient generation and dimensionality-aware modeling in high-dimensional data with intrinsic low-dimensional structure.

Abstract

Real-world data is often assumed to lie within a low-dimensional structure embedded in high-dimensional space. In practical settings, we observe only a finite set of samples, forming what we refer to as the sample data subspace. It serves an essential approximation supporting tasks such as dimensionality reduction and generation. A major challenge lies in whether generative models can reliably synthesize samples that stay within this subspace rather than drifting away from the underlying structure. In this work, we provide theoretical insights into this challenge by leveraging Flow Matching models, which transform a simple prior into a complex target distribution via a learned velocity field. By treating the real data distribution as discrete, we derive analytical expressions for the optimal velocity field under a Gaussian prior, showing that generated samples memorize real data points and represent the sample data subspace exactly. To generalize to suboptimal scenarios, we introduce the Orthogonal Subspace Decomposition Network (OSDNet), which systematically decomposes the velocity field into subspace and off-subspace components. Our analysis shows that the off-subspace component decays, while the subspace component generalizes within the sample data subspace, ensuring generated samples preserve both proximity and diversity.

How Do Flow Matching Models Memorize and Generalize in Sample Data Subspaces?

TL;DR

This work analyzes how Flow Matching models memorize and generalize when real data is treated as a discrete, low-dimensional subspace within high-dimensional space. It derives an explicit optimal velocity field under a Gaussian prior, showing generation paths memorize real data points via a softmax-weighted blend toward the data, thereby exactly representing the sample subspace. To handle suboptimal settings, it introduces OSDNet, decomposing velocity fields into subspace and off-subspace parts, proving the off-subspace component decays while the subspace component generalizes within the data subspace; a teacher-student training framework decouples these terms and yields bounds linking generation accuracy to training loss. Collectively, the results illuminate how to preserve proximity and diversity to the data subspace in Flow Matching, guiding principled design of models with robust subspace generalization and memorization characteristics. These insights have practical implications for efficient generation and dimensionality-aware modeling in high-dimensional data with intrinsic low-dimensional structure.

Abstract

Real-world data is often assumed to lie within a low-dimensional structure embedded in high-dimensional space. In practical settings, we observe only a finite set of samples, forming what we refer to as the sample data subspace. It serves an essential approximation supporting tasks such as dimensionality reduction and generation. A major challenge lies in whether generative models can reliably synthesize samples that stay within this subspace rather than drifting away from the underlying structure. In this work, we provide theoretical insights into this challenge by leveraging Flow Matching models, which transform a simple prior into a complex target distribution via a learned velocity field. By treating the real data distribution as discrete, we derive analytical expressions for the optimal velocity field under a Gaussian prior, showing that generated samples memorize real data points and represent the sample data subspace exactly. To generalize to suboptimal scenarios, we introduce the Orthogonal Subspace Decomposition Network (OSDNet), which systematically decomposes the velocity field into subspace and off-subspace components. Our analysis shows that the off-subspace component decays, while the subspace component generalizes within the sample data subspace, ensuring generated samples preserve both proximity and diversity.

Paper Structure

This paper contains 44 sections, 20 theorems, 210 equations, 9 figures.

Key Result

Proposition 1

Assume that the function class $\{{\bm{v}}_t({\bm{x}};{\bm{\theta}})\}$ has enough capacity. Then the optimal velocity field ${\bm{v}}_t^*({\bm{x}})$ which minimizes the Conditional Flow Matching (CFM) loss is given by

Figures (9)

  • Figure 1: The conceptual illustration of \ref{['thm:path_geometry_straight']}, which shows the regions where the softmax weights in the optimal velocity field are not close to a one-hot vector. The black triangles represent the sparse and well-separated real data points. The gray areas indicate regions where the largest entry of the softmax weight is less than $0.99$. The darkness of the green areas reflects the densities of $p_t$. Initially, the probability that $x \sim p_t$ falls into the gray region is high (at $t=0$), but it decreases rapidly over time. This demonstrates that the softmax weights quickly focus on a single data point as the generation process evolves. The core of \ref{['thm:path_geometry_straight']} lies in estimating this probability.
  • Figure 2: The conceptual illustration of \ref{['thm:path_geometry_hierarchical']}, which depicts the hierarchy emergence. The gray disk represents the set $S$ where $p_0$ is supported, while the colored regions at $t = 1$ indicate the support of $p_1$. At intermediate stages, the irregularly shaped regions correspond to $(1-t)S + tC_i$. Initially, at $t = 0$, all regions coincide with $S$, forming a mixed area. As time progresses, the regions start to separate, and once a generation path enters a specific region, it remains confined to it. The main result of \ref{['thm:path_geometry_hierarchical']} is the proof of a separation time, strictly before $t = 1$, when the regions $(1-t)S + tC_i$ become fully distinct.
  • Figure 3: An illustration of the Orthogonal Subspace Decomposition Network (OSDNet) in a 2-dimensional space, where time $t$ progresses from left to right with an interval of $\Delta t$. The velocity vector $\hat{{\bm{v}}}_t({\bm{x}})$ (shown by the red arrow pointing from ${\bm{x}}$ to ${\bm{x}} + \hat{{\bm{v}}}_t({\bm{x}}) \cdot \Delta t$) is decomposed into two components: one within the subspace ${\bm{V}}$ (cyan) and the other in the orthogonal complement ${\bm{V}}^\perp$ (green). These components, ${\bm{V}} \cdot \hat{{\bm{s}}}_t({\bm{x}})$ and ${\bm{V}}^\perp \cdot \hat{{\bm{O}}}_t \cdot ({\bm{V}}^\perp)^\top \cdot {\bm{x}}$, are computed by OSDNet, represented within the white background. The trapezoidal shapes in the network indicate the input and output dimensions, while the symbol $\otimes$ represents matrix multiplication. The dotted curves illustrate the trajectory ${\bm{\phi}}_t({\bm{x}})$ before time $t$ and after time $t + \Delta t$.
  • Figure 4: The generation paths of Flow Matching models applied to sparse and well separated datasets under the optimal velocity field. The black triangles represent the real data points, which are $N = 6$ points randomly sampled from the region $[-10, 10] \times [-10, 10]$. The green crosses denote noise samples. The colored trajectories depict generation paths starting from various noise samples, with brighter colors indicating larger $t$. As $t$ increases, these paths straighten, converging towards specific real data points, validating \ref{['thm:path_geometry_straight']}.
  • Figure 5: The generation paths of Flow Matching models applied to hierarchical datasets (with one hierarchy) under the optimal velocity field. The black triangles represent real data points, which are obtained by adding Gaussian noise with a standard deviation of $0.5$ to the four cluster centers $(-2, 2)$, $(-2, -2)$, $(2, 2)$, and $(2, -2)$. In the first subfigure ($t=0$), the green crosses show the initial noise samples, and the colored trajectories depict generation paths with brighter colors indicating larger $t$. These paths are more curved than those in \ref{['fig:exp_path_geometry_straight_trajectory']}. In the second, third, and fourth subfigures, the markers represent the intermediate points ${\bm{\phi}}_t({\bm{x}})$: cyan squares for $t=0.25$, orange diamonds for $t=0.5$, and red pentagons for $t=0.75$. The regions covered by these intermediate points overlap for $t=0, 0.25,$ and $0.5$, but are disjoint at $t=0.75$, providing visual validation for \ref{['thm:path_geometry_hierarchical']}.
  • ...and 4 more figures

Theorems & Definitions (21)

  • Proposition 1: Optimal velocity field for conditional flow matching
  • Theorem 4.1: Optimal velocity field for discrete target distribution
  • Corollary 1: Marginal probability path induced by the optimal velocity field lipman2023flow
  • Theorem 4.2: Probability bound for softmax weight concentration
  • Theorem 4.3: Hierarchy emergence of generation paths
  • Theorem 4.4: Memorization phenomenon under the optimal velocity field
  • Definition 1: Orthogonal Subspace Decomposition Network (OSDNet)
  • Proposition 2: Optimal velocity field as a specific instance of OSDNet
  • Proposition 3: Teacher-student training of OSDNet
  • Proposition 4: Teacher-student training of $\hat{{\bm{O}}}_t$
  • ...and 11 more