Table of Contents
Fetching ...

Carré du champ flow matching: better quality-generalisation tradeoff in generative models

Jacob Bamberger, Iolo Jones, Dennis Duncan, Michael M. Bronstein, Pierre Vandergheynst, Adam Gosztolai

TL;DR

This work tackles the quality-generalisation tradeoff in flow-based generative models by introducing Carré du champ Flow Matching (CDC-FM), a geometry-aware generalisation of Flow Matching (FM). CDC-FM replaces FM’s homogeneous diffusion with an anisotropic, data-driven noise guided by a locally estimated carré du champ Γ̂, derived from diffusion-geometry techniques, yielding probability paths that align with the data manifold via displacement interpolants. The authors provide theoretical justification that the approach corresponds to optimal transport and anisotropic diffusion on the data geometry, and they develop a scalable practical estimator for Γ̂. Empirically, CDC-FM achieves comparable or better sample quality while substantially reducing memorisation across geometric datasets (LiDAR, single-cell trajectories, motion capture) and standard architectures (MLPs, CNNs, transformers), with notable gains in data-scarce or heterogeneously sampled regimes. The method serves as a plug-in regulariser that strengthens generalisation without sacrificing fidelity, offering a principled path toward geometry-aware flow-based generative modelling at scale.

Abstract

Deep generative models often face a fundamental tradeoff: high sample quality can come at the cost of memorisation, where the model reproduces training data rather than generalising across the underlying data geometry. We introduce Carré du champ flow matching (CDC-FM), a generalisation of flow matching (FM), that improves the quality-generalisation tradeoff by regularising the probability path with a geometry-aware noise. Our method replaces the homogeneous, isotropic noise in FM with a spatially varying, anisotropic Gaussian noise whose covariance captures the local geometry of the latent data manifold. We prove that this geometric noise can be optimally estimated from the data and is scalable to large data. Further, we provide an extensive experimental evaluation on diverse datasets (synthetic manifolds, point clouds, single-cell genomics, animal motion capture, and images) as well as various neural network architectures (MLPs, CNNs, and transformers). We demonstrate that CDC-FM consistently offers a better quality-generalisation tradeoff. We observe significant improvements over standard FM in data-scarce regimes and in highly non-uniformly sampled datasets, which are often encountered in AI for science applications. Our work provides a mathematical framework for studying the interplay between data geometry, generalisation and memorisation in generative models, as well as a robust and scalable algorithm that can be readily integrated into existing flow matching pipelines.

Carré du champ flow matching: better quality-generalisation tradeoff in generative models

TL;DR

This work tackles the quality-generalisation tradeoff in flow-based generative models by introducing Carré du champ Flow Matching (CDC-FM), a geometry-aware generalisation of Flow Matching (FM). CDC-FM replaces FM’s homogeneous diffusion with an anisotropic, data-driven noise guided by a locally estimated carré du champ Γ̂, derived from diffusion-geometry techniques, yielding probability paths that align with the data manifold via displacement interpolants. The authors provide theoretical justification that the approach corresponds to optimal transport and anisotropic diffusion on the data geometry, and they develop a scalable practical estimator for Γ̂. Empirically, CDC-FM achieves comparable or better sample quality while substantially reducing memorisation across geometric datasets (LiDAR, single-cell trajectories, motion capture) and standard architectures (MLPs, CNNs, transformers), with notable gains in data-scarce or heterogeneously sampled regimes. The method serves as a plug-in regulariser that strengthens generalisation without sacrificing fidelity, offering a principled path toward geometry-aware flow-based generative modelling at scale.

Abstract

Deep generative models often face a fundamental tradeoff: high sample quality can come at the cost of memorisation, where the model reproduces training data rather than generalising across the underlying data geometry. We introduce Carré du champ flow matching (CDC-FM), a generalisation of flow matching (FM), that improves the quality-generalisation tradeoff by regularising the probability path with a geometry-aware noise. Our method replaces the homogeneous, isotropic noise in FM with a spatially varying, anisotropic Gaussian noise whose covariance captures the local geometry of the latent data manifold. We prove that this geometric noise can be optimally estimated from the data and is scalable to large data. Further, we provide an extensive experimental evaluation on diverse datasets (synthetic manifolds, point clouds, single-cell genomics, animal motion capture, and images) as well as various neural network architectures (MLPs, CNNs, and transformers). We demonstrate that CDC-FM consistently offers a better quality-generalisation tradeoff. We observe significant improvements over standard FM in data-scarce regimes and in highly non-uniformly sampled datasets, which are often encountered in AI for science applications. Our work provides a mathematical framework for studying the interplay between data geometry, generalisation and memorisation in generative models, as well as a robust and scalable algorithm that can be readily integrated into existing flow matching pipelines.

Paper Structure

This paper contains 43 sections, 4 theorems, 43 equations, 11 figures, 9 tables, 2 algorithms.

Key Result

Proposition 1

Given that $p_0(x | x_0, x_1)=\mathcal{N}(x_0, \widehat{\mathbf{\Gamma}}(x_0))$ and $p_1(x|x_0, x_1)=\mathcal{N}(x_1, \widehat{\mathbf{\Gamma}}(x_1))$ are Gaussian, the displacement interpolant between them, $p_t(x | x_0, x_1) := [p_0,\,p_1]_t$, is also Gaussian, with mean and covariance given by where $\mathbf{A}_t = \left(1-t\right)\mathbf{I} + t\mathbf{B}$ and $\mathbf{B} := \widehat{\mathbf{\

Figures (11)

  • Figure 1: Carré du champ flow matching.a FM conditional path design is oblivious to the manifold structure, which can result in off-manifold samples, shown by the black arrows ($\sigma_{\min} >0$). b Conditional velocity fields (blue arrows) in FM transport mass to training points. c Generated density by FM trained on eight samples of a unit circle ($\sigma_{\min}=0$). FM memorises, concentrating likelihood around training points. d CDC-FM conditional probability paths are the displacement (optimal transport) interpolants between local covariances and are thus aligned with the geometry. e CDC-FM conditional velocity fields flow perpendicular to the manifold. f CDC-FM regularises along the manifold, mitigating memorisation and facilitating generalisation.
  • Figure 2: Visual comparison of FM vs CDC-FM for LiDAR data.
  • Figure 3: Early stopping for spatially heterogeneous data.a Samples from FM and CDC-FM trained on the two-circles dataset at an epoch late in the training (40k), when FM captures the small circle and memorises samples on the larger one. b Quality, c generalisation, and d memorisation against training epoch for the two methods, presented separately for the two circles. Lines represent means over samples.
  • Figure 4: Quality-generalisation tradeoff for animal motion capture data.a Inset: an example fruit fly pose sequence. Point cloud, each point representing a 31-frame pose sequence, visualised in 3D UMAP coordinates. Shading indicates walking speed. b Generalisation against quality for CDC-FM for different $\widehat{\mathbf{\Gamma}}$ ranks, $d_{cdc}$, and for FM for different $\sigma_{\min}$. Black circles indicate epochs analysed in e. c Same as b, but for memorisation. d Generalisation against epochs. e Percentage of memorised samples nearest to a training point (FM: $\sigma_{\min}\!=\!0$, CDC-FM: $d_{cdc}=16$). f Variation of train data sparsity. g Average memorisation against sparsity for different epochs.
  • Figure 5: Synthetic experiment on toroidal manifold. Effect of data dimension on a sample quality, b memorisation and c generalisation.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Proposition 1
  • proof
  • Theorem 1
  • proof
  • Proposition 2
  • proof
  • Theorem 2
  • proof