Table of Contents
Fetching ...

Recent Advances in Optimal Transport for Machine Learning

Eduardo Fernandes Montesuma, Fred Ngolè Mboula, Antoine Souloumiac

TL;DR

Optimal transport (OT) provides a principled framework for comparing and transforming probability distributions, anchored by the Wasserstein distance $W_p$ and the Monge–Kantorovich formulations with transport plans in $\Gamma(P,Q)$. The paper surveys theory, computation, and ML applications from 2012–2023, covering entropy-regularized and unbalanced variants, GW/FGW, and neural OT solvers, as well as practical uses across supervised, unsupervised, transfer, and reinforcement learning. It highlights OT as both a loss function and a distribution-manipulation toolkit, with applications ranging from OT-based losses and fairness in supervised learning to generative modeling, dictionary learning, clustering, domain adaptation, and distributional RL. Major challenges include the curse of dimensionality and computational burden, while promising directions involve learned ground costs, sliced/generalized OT, and scalable neural OT architectures that integrate OT into end-to-end ML pipelines.

Abstract

Recently, Optimal Transport has been proposed as a probabilistic framework in Machine Learning for comparing and manipulating probability distributions. This is rooted in its rich history and theory, and has offered new solutions to different problems in machine learning, such as generative modeling and transfer learning. In this survey we explore contributions of Optimal Transport for Machine Learning over the period 2012 -- 2023, focusing on four sub-fields of Machine Learning: supervised, unsupervised, transfer and reinforcement learning. We further highlight the recent development in computational Optimal Transport and its extensions, such as partial, unbalanced, Gromov and Neural Optimal Transport, and its interplay with Machine Learning practice.

Recent Advances in Optimal Transport for Machine Learning

TL;DR

Optimal transport (OT) provides a principled framework for comparing and transforming probability distributions, anchored by the Wasserstein distance and the Monge–Kantorovich formulations with transport plans in . The paper surveys theory, computation, and ML applications from 2012–2023, covering entropy-regularized and unbalanced variants, GW/FGW, and neural OT solvers, as well as practical uses across supervised, unsupervised, transfer, and reinforcement learning. It highlights OT as both a loss function and a distribution-manipulation toolkit, with applications ranging from OT-based losses and fairness in supervised learning to generative modeling, dictionary learning, clustering, domain adaptation, and distributional RL. Major challenges include the curse of dimensionality and computational burden, while promising directions involve learned ground costs, sliced/generalized OT, and scalable neural OT architectures that integrate OT into end-to-end ML pipelines.

Abstract

Recently, Optimal Transport has been proposed as a probabilistic framework in Machine Learning for comparing and manipulating probability distributions. This is rooted in its rich history and theory, and has offered new solutions to different problems in machine learning, such as generative modeling and transfer learning. In this survey we explore contributions of Optimal Transport for Machine Learning over the period 2012 -- 2023, focusing on four sub-fields of Machine Learning: supervised, unsupervised, transfer and reinforcement learning. We further highlight the recent development in computational Optimal Transport and its extensions, such as partial, unbalanced, Gromov and Neural Optimal Transport, and its interplay with Machine Learning practice.
Paper Structure (30 sections, 107 equations, 15 figures, 1 table)

This paper contains 30 sections, 107 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Illustration of (a) Monge formulation, (b) Kantorovich formulation and (c) Benamou-Brenier formulation. While (a) focuses on transportation maps $T$, (b) relies on transport plans $\gamma$ and (c) revolves around interpolations $\rho(t,x)$.
  • Figure 2: Comparison on how different metrics and divergences calculate discrepancies on the manifold of Gaussian distributions. The geometry induced by the Wasserstein distance is simpler, and more intuitive than those given by other measures of discrepancy, which affects optimization procedures in machine learning.
  • Figure 3: An illustration of the sliced and max-sliced Wasserstein distances over 2-D distributions (a). In (b), we show the densities of $P$ and $Q$ after a projection by $\mathbf{u}$. In (c), we illustrate the computation of the 1-D Wasserstein distance for $p=1$, as the horizontal difference between the cumulative distributions of $P$ and $Q$. In (d), we show the distribution of the Wasserstein distance over $\mathbf{u} \sim \mathbb{S}^{1}$, alongside the mean (purple) and max (red) values. In (e), we show the Wasserstein distances over $\mathbf{u} \in \mathbb{S}^{1}$. Finally, (f) shows the estimation of the $\text{SW}_{2}$ and max-SW$_{2}$ as a function of the number of projections $L$. Shaded regions show a 95% confidence interval around the average value.
  • Figure 4: architecture proposed by amos2017input, which implements a convex function $f(\mathbf{x};\theta)$ with respect inputs $\mathbf{x}$.
  • Figure 5: Mini-batch between distributions $P$ (in blue) and $Q$ (in orange). As follows, an plan is calculated with mini-batches of 2 (a), 10 (b) and 100 (c) samples. (c) corresponds to the original problem. Overall, in mini-batch the plans become less sparse, due to being forced to transport all mass between mini-batches.
  • ...and 10 more figures