Table of Contents
Fetching ...

Flow Matching with Semidiscrete Couplings

Alireza Mousavi-Hosseini, Stephen Y. Zhang, Michal Klein, Marco Cuturi

TL;DR

This paper addresses the computational bottleneck of OT-guided flow matching for training flow-based generative models by introducing semidiscrete optimal transport (SD-OT). SD-FM learns a dual potential over the discrete data support and uses a fast MIPS-based lookup to assign fresh noise during training, avoiding the quadratic costs of batch OT while preserving the benefits of OT-inspired pairings. Theoretical guarantees for SGD convergence under the semidiscrete formulation, plus a practical convergence criterion via a chi-squared divergence estimator, underpin the method. Empirically, SD-FM achieves superior training and inference efficiency with improved FID and sample quality across unconditional/conditional generation tasks, image super-resolution, and guidance scenarios, often by orders of magnitude less computation than OT-FM and with greater robustness than I-FM.

Abstract

Flow models parameterized as time-dependent velocity fields can generate data from noise by integrating an ODE. These models are often trained using flow matching, i.e. by sampling random pairs of noise and target points $(\mathbf{x}_0,\mathbf{x}_1)$ and ensuring that the velocity field is aligned, on average, with $\mathbf{x}_1-\mathbf{x}_0$ when evaluated along a segment linking $\mathbf{x}_0$ to $\mathbf{x}_1$. While these pairs are sampled independently by default, they can also be selected more carefully by matching batches of $n$ noise to $n$ target points using an optimal transport (OT) solver. Although promising in theory, the OT flow matching (OT-FM) approach is not widely used in practice. Zhang et al. (2025) pointed out recently that OT-FM truly starts paying off when the batch size $n$ grows significantly, which only a multi-GPU implementation of the Sinkhorn algorithm can handle. Unfortunately, the costs of running Sinkhorn can quickly balloon, requiring $O(n^2/\varepsilon^2)$ operations for every $n$ pairs used to fit the velocity field, where $\varepsilon$ is a regularization parameter that should be typically small to yield better results. To fulfill the theoretical promises of OT-FM, we propose to move away from batch-OT and rely instead on a semidiscrete formulation that leverages the fact that the target dataset distribution is usually of finite size $N$. The SD-OT problem is solved by estimating a dual potential vector using SGD; using that vector, freshly sampled noise vectors at train time can then be matched with data points at the cost of a maximum inner product search (MIPS). Semidiscrete FM (SD-FM) removes the quadratic dependency on $n/\varepsilon$ that bottlenecks OT-FM. SD-FM beats both FM and OT-FM on all training metrics and inference budget constraints, across multiple datasets, on unconditional/conditional generation, or when using mean-flow models.

Flow Matching with Semidiscrete Couplings

TL;DR

This paper addresses the computational bottleneck of OT-guided flow matching for training flow-based generative models by introducing semidiscrete optimal transport (SD-OT). SD-FM learns a dual potential over the discrete data support and uses a fast MIPS-based lookup to assign fresh noise during training, avoiding the quadratic costs of batch OT while preserving the benefits of OT-inspired pairings. Theoretical guarantees for SGD convergence under the semidiscrete formulation, plus a practical convergence criterion via a chi-squared divergence estimator, underpin the method. Empirically, SD-FM achieves superior training and inference efficiency with improved FID and sample quality across unconditional/conditional generation tasks, image super-resolution, and guidance scenarios, often by orders of magnitude less computation than OT-FM and with greater robustness than I-FM.

Abstract

Flow models parameterized as time-dependent velocity fields can generate data from noise by integrating an ODE. These models are often trained using flow matching, i.e. by sampling random pairs of noise and target points and ensuring that the velocity field is aligned, on average, with when evaluated along a segment linking to . While these pairs are sampled independently by default, they can also be selected more carefully by matching batches of noise to target points using an optimal transport (OT) solver. Although promising in theory, the OT flow matching (OT-FM) approach is not widely used in practice. Zhang et al. (2025) pointed out recently that OT-FM truly starts paying off when the batch size grows significantly, which only a multi-GPU implementation of the Sinkhorn algorithm can handle. Unfortunately, the costs of running Sinkhorn can quickly balloon, requiring operations for every pairs used to fit the velocity field, where is a regularization parameter that should be typically small to yield better results. To fulfill the theoretical promises of OT-FM, we propose to move away from batch-OT and rely instead on a semidiscrete formulation that leverages the fact that the target dataset distribution is usually of finite size . The SD-OT problem is solved by estimating a dual potential vector using SGD; using that vector, freshly sampled noise vectors at train time can then be matched with data points at the cost of a maximum inner product search (MIPS). Semidiscrete FM (SD-FM) removes the quadratic dependency on that bottlenecks OT-FM. SD-FM beats both FM and OT-FM on all training metrics and inference budget constraints, across multiple datasets, on unconditional/conditional generation, or when using mean-flow models.

Paper Structure

This paper contains 39 sections, 9 theorems, 90 equations, 17 figures, 4 tables, 6 algorithms.

Key Result

Theorem 2

Suppose either $\varepsilon > 0$ or Assumption assump:reg is satisfied. Let $L_\varepsilon \coloneqq 1/\varepsilon$ for $\varepsilon > 0$ and $L_0 \coloneqq C^{\mathrm{max}}_\mu/\delta$ else. For any $K \in {\mathbb{N}}$, let $\eta_k = \sqrt{\Delta/(L_\varepsilon K)}$ be a constant learning rate, w

Figures (17)

  • Figure 1: I-FM (left) assigns noise to data purely at random. OT-FM (middle-left) samples batches of $n$ noise and $n$ data points and re-aligns them with an optimal matching permutation $\sigma^\star$. These matches are, however, inherently unstable, as $n$ points do not reflect the whole noise distribution nor the dataset. Increasing drastically $n$ can mitigate this issue zhang2025fitting, but at a significant cost. Our method, SD-FM (right), solves these issues in two steps: in a precompute phase, the semidiscrete OT problem (parameterized as a vector of size $N$, the dataset size) is solved using SGD. At FM train time , each newly sampled noise is assigned to a data point using a maximum inner product search, Laguerre cells merigot2011multiscale being illustrated in the plot. Our figure uses no entropic regularization ($\varepsilon=0$) and a neg-dot-product cost for simplicity, see equation \ref{['eq:soft-min']} for more generality.
  • Figure 2: SD-OT convergence: ${\widehat{\chi}}^2$-divergence vs. SGD optimization steps for ImageNet-32, averaged over 3 seeds. See \ref{['sec:appendix_experiments']} for details.
  • Figure 3: Better SD Potential Estimation = Better Curvature and FID. On ImgN32, convergence of dual potential $\mathbf{g}$vs. SD-FM ($\varepsilon=0$) curvature and FID; I-FM is shown as lines. Note that curvatures of different solvers are computed on different trajectories, hence they are not comparable.
  • Figure 3: CelebA super-resolution results.
  • Figure 4: FID vs. time needed to form a pair when training I-FM, OT-FM (varying batch sizes $n$) and SD-FM. We use $\varepsilon = 0$ (SD only), $0.01, 0.1$. Couplings are computed using full $d$ or PCA space. Red lines show the per-sample time $\Theta$ needed to compute the gradient of the loss for one pair. SD-FM yields significant improvements for negligible overhead.
  • ...and 12 more figures

Theorems & Definitions (18)

  • Theorem 2
  • Proposition 3: Generalized Tweedie's Formula
  • Proposition 4: Informal
  • proof : Proof of \ref{['eq:chisq']}
  • proof : Proof of \ref{['prop:score_formula']}
  • Proposition 5
  • Lemma 6
  • proof
  • proof : Proof of \ref{['prop:cfg_formal']}
  • Lemma 7
  • ...and 8 more