Table of Contents
Fetching ...

Optimal and Diffusion Transports in Machine Learning

Gabriel Peyré

TL;DR

This work unifies diffusion-based generative modeling and optimal transport as evolutions of probability measures, connecting practical algorithms with rigorous measure-theoretic PDE frameworks. It surveys flow matching and diffusion methods, Benamou–Brenier dynamic OT, and JKO/Wasserstein gradient flows, highlighting gradient structures, mean-field limits, and their relevance to neural networks and transformers. Key contributions include explicit flow constructions (flow matching, conditional-expectation fields), Gaussian-special cases with closed-form solutions, and universality results for depth in transformers, along with clear open questions on sample complexity, discretization, and deep architectures. The framework provides a cohesive lens to analyze sampling, training dynamics, and token evolution, with broad implications for designing scalable, stable, and interpretable ML systems.

Abstract

Several problems in machine learning are naturally expressed as the design and analysis of time-evolving probability distributions. This includes sampling via diffusion methods, optimizing the weights of neural networks, and analyzing the evolution of token distributions across layers of large language models. While the targeted applications differ (samples, weights, tokens), their mathematical descriptions share a common structure. A key idea is to switch from the Eulerian representation of densities to their Lagrangian counterpart through vector fields that advect particles. This dual view introduces challenges, notably the non-uniqueness of Lagrangian vector fields, but also opportunities to craft density evolutions and flows with favorable properties in terms of regularity, stability, and computational tractability. This survey presents an overview of these methods, with emphasis on two complementary approaches: diffusion methods, which rely on stochastic interpolation processes and underpin modern generative AI, and optimal transport, which defines interpolation by minimizing displacement cost. We illustrate how both approaches appear in applications ranging from sampling, neural network optimization, to modeling the dynamics of transformers for large language models.

Optimal and Diffusion Transports in Machine Learning

TL;DR

This work unifies diffusion-based generative modeling and optimal transport as evolutions of probability measures, connecting practical algorithms with rigorous measure-theoretic PDE frameworks. It surveys flow matching and diffusion methods, Benamou–Brenier dynamic OT, and JKO/Wasserstein gradient flows, highlighting gradient structures, mean-field limits, and their relevance to neural networks and transformers. Key contributions include explicit flow constructions (flow matching, conditional-expectation fields), Gaussian-special cases with closed-form solutions, and universality results for depth in transformers, along with clear open questions on sample complexity, discretization, and deep architectures. The framework provides a cohesive lens to analyze sampling, training dynamics, and token evolution, with broad implications for designing scalable, stable, and interpretable ML systems.

Abstract

Several problems in machine learning are naturally expressed as the design and analysis of time-evolving probability distributions. This includes sampling via diffusion methods, optimizing the weights of neural networks, and analyzing the evolution of token distributions across layers of large language models. While the targeted applications differ (samples, weights, tokens), their mathematical descriptions share a common structure. A key idea is to switch from the Eulerian representation of densities to their Lagrangian counterpart through vector fields that advect particles. This dual view introduces challenges, notably the non-uniqueness of Lagrangian vector fields, but also opportunities to craft density evolutions and flows with favorable properties in terms of regularity, stability, and computational tractability. This survey presents an overview of these methods, with emphasis on two complementary approaches: diffusion methods, which rely on stochastic interpolation processes and underpin modern generative AI, and optimal transport, which defines interpolation by minimizing displacement cost. We illustrate how both approaches appear in applications ranging from sampling, neural network optimization, to modeling the dynamics of transformers for large language models.

Paper Structure

This paper contains 29 sections, 11 theorems, 46 equations.

Key Result

Proposition 3.1

Let $(X_0,X_1)\sim \alpha_0 \otimes \alpha_1$ and define, for $t\in(0,1)$, Then the pair $(\alpha_t,v_t)$ satisfies the continuity equation eq:eulerian-advection in the weak sense.

Theorems & Definitions (11)

  • Proposition 3.1: Conditional expectation field, lipman2022flowalbergo2023stochastic
  • Proposition 3.2: hurault2025score
  • Proposition 3.3: Optimal vector field
  • Theorem 4.1: Brenier brenier1991polar
  • Theorem 4.2: Benamou--Brenier benamou2000computational
  • Proposition 4.3
  • Theorem 5.1: Jordan–Kinderlehrer–Otto jordan1998variational
  • Theorem 6.1: Chizat--Bach
  • Proposition 6.2: Wasserstein flow for linear networks closes on Gaussians
  • Theorem 7.1: Gaussian closure and covariance dynamics castin2025unified
  • ...and 1 more