How Do Flow Matching Models Memorize and Generalize in Sample Data Subspaces?
Weiguo Gao, Ming Li
TL;DR
This work analyzes how Flow Matching models memorize and generalize when real data is treated as a discrete, low-dimensional subspace within high-dimensional space. It derives an explicit optimal velocity field under a Gaussian prior, showing generation paths memorize real data points via a softmax-weighted blend toward the data, thereby exactly representing the sample subspace. To handle suboptimal settings, it introduces OSDNet, decomposing velocity fields into subspace and off-subspace parts, proving the off-subspace component decays while the subspace component generalizes within the data subspace; a teacher-student training framework decouples these terms and yields bounds linking generation accuracy to training loss. Collectively, the results illuminate how to preserve proximity and diversity to the data subspace in Flow Matching, guiding principled design of models with robust subspace generalization and memorization characteristics. These insights have practical implications for efficient generation and dimensionality-aware modeling in high-dimensional data with intrinsic low-dimensional structure.
Abstract
Real-world data is often assumed to lie within a low-dimensional structure embedded in high-dimensional space. In practical settings, we observe only a finite set of samples, forming what we refer to as the sample data subspace. It serves an essential approximation supporting tasks such as dimensionality reduction and generation. A major challenge lies in whether generative models can reliably synthesize samples that stay within this subspace rather than drifting away from the underlying structure. In this work, we provide theoretical insights into this challenge by leveraging Flow Matching models, which transform a simple prior into a complex target distribution via a learned velocity field. By treating the real data distribution as discrete, we derive analytical expressions for the optimal velocity field under a Gaussian prior, showing that generated samples memorize real data points and represent the sample data subspace exactly. To generalize to suboptimal scenarios, we introduce the Orthogonal Subspace Decomposition Network (OSDNet), which systematically decomposes the velocity field into subspace and off-subspace components. Our analysis shows that the off-subspace component decays, while the subspace component generalizes within the sample data subspace, ensuring generated samples preserve both proximity and diversity.
