Optimal and Diffusion Transports in Machine Learning
Gabriel Peyré
TL;DR
This work unifies diffusion-based generative modeling and optimal transport as evolutions of probability measures, connecting practical algorithms with rigorous measure-theoretic PDE frameworks. It surveys flow matching and diffusion methods, Benamou–Brenier dynamic OT, and JKO/Wasserstein gradient flows, highlighting gradient structures, mean-field limits, and their relevance to neural networks and transformers. Key contributions include explicit flow constructions (flow matching, conditional-expectation fields), Gaussian-special cases with closed-form solutions, and universality results for depth in transformers, along with clear open questions on sample complexity, discretization, and deep architectures. The framework provides a cohesive lens to analyze sampling, training dynamics, and token evolution, with broad implications for designing scalable, stable, and interpretable ML systems.
Abstract
Several problems in machine learning are naturally expressed as the design and analysis of time-evolving probability distributions. This includes sampling via diffusion methods, optimizing the weights of neural networks, and analyzing the evolution of token distributions across layers of large language models. While the targeted applications differ (samples, weights, tokens), their mathematical descriptions share a common structure. A key idea is to switch from the Eulerian representation of densities to their Lagrangian counterpart through vector fields that advect particles. This dual view introduces challenges, notably the non-uniqueness of Lagrangian vector fields, but also opportunities to craft density evolutions and flows with favorable properties in terms of regularity, stability, and computational tractability. This survey presents an overview of these methods, with emphasis on two complementary approaches: diffusion methods, which rely on stochastic interpolation processes and underpin modern generative AI, and optimal transport, which defines interpolation by minimizing displacement cost. We illustrate how both approaches appear in applications ranging from sampling, neural network optimization, to modeling the dynamics of transformers for large language models.
