Table of Contents
Fetching ...

Differentiable Generalized Sliced Wasserstein Plans

Laetitia Chapel, Romain Tavenard, Samuel Vaiter

TL;DR

This work tackles the computational bottleneck of optimal transport plan computation by introducing Differentiable Generalized Sliced Wasserstein Plans (DGSWP), which extend slicing-based OT with non-linear projections and a differentiable bilevel optimization framework. By leveraging a Stein-based smoothing of the outer objective, it provides a differentiable, GPU-efficient surrogate that yields meaningful transport plans even in high dimensions and on manifolds, with gradient information guiding the projection map. Empirically, DGSWP improves transport costs over prior sliced methods, enables robust gradient flows in Euclidean and hyperbolic spaces, and enhances image-generation workflows by replacing costly mini-batch OT in conditional flow matching. The approach offers practical impact for scalable OT in large-scale learning tasks, including manifold-valued data and generative modeling, while opening directions for ensuring projection injectivity and exploring injective neural architectures.

Abstract

Optimal Transport (OT) has attracted significant interest in the machine learning community, not only for its ability to define meaningful distances between probability distributions -- such as the Wasserstein distance -- but also for its formulation of OT plans. Its computational complexity remains a bottleneck, though, and slicing techniques have been developed to scale OT to large datasets. Recently, a novel slicing scheme, dubbed min-SWGG, lifts a single one-dimensional plan back to the original multidimensional space, finally selecting the slice that yields the lowest Wasserstein distance as an approximation of the full OT plan. Despite its computational and theoretical advantages, min-SWGG inherits typical limitations of slicing methods: (i) the number of required slices grows exponentially with the data dimension, and (ii) it is constrained to linear projections. Here, we reformulate min-SWGG as a bilevel optimization problem and propose a differentiable approximation scheme to efficiently identify the optimal slice, even in high-dimensional settings. We furthermore define its generalized extension for accommodating to data living on manifolds. Finally, we demonstrate the practical value of our approach in various applications, including gradient flows on manifolds and high-dimensional spaces, as well as a novel sliced OT-based conditional flow matching for image generation -- where fast computation of transport plans is essential.

Differentiable Generalized Sliced Wasserstein Plans

TL;DR

This work tackles the computational bottleneck of optimal transport plan computation by introducing Differentiable Generalized Sliced Wasserstein Plans (DGSWP), which extend slicing-based OT with non-linear projections and a differentiable bilevel optimization framework. By leveraging a Stein-based smoothing of the outer objective, it provides a differentiable, GPU-efficient surrogate that yields meaningful transport plans even in high dimensions and on manifolds, with gradient information guiding the projection map. Empirically, DGSWP improves transport costs over prior sliced methods, enables robust gradient flows in Euclidean and hyperbolic spaces, and enhances image-generation workflows by replacing costly mini-batch OT in conditional flow matching. The approach offers practical impact for scalable OT in large-scale learning tasks, including manifold-valued data and generative modeling, while opening directions for ensuring projection injectivity and exploring injective neural architectures.

Abstract

Optimal Transport (OT) has attracted significant interest in the machine learning community, not only for its ability to define meaningful distances between probability distributions -- such as the Wasserstein distance -- but also for its formulation of OT plans. Its computational complexity remains a bottleneck, though, and slicing techniques have been developed to scale OT to large datasets. Recently, a novel slicing scheme, dubbed min-SWGG, lifts a single one-dimensional plan back to the original multidimensional space, finally selecting the slice that yields the lowest Wasserstein distance as an approximation of the full OT plan. Despite its computational and theoretical advantages, min-SWGG inherits typical limitations of slicing methods: (i) the number of required slices grows exponentially with the data dimension, and (ii) it is constrained to linear projections. Here, we reformulate min-SWGG as a bilevel optimization problem and propose a differentiable approximation scheme to efficiently identify the optimal slice, even in high-dimensional settings. We furthermore define its generalized extension for accommodating to data living on manifolds. Finally, we demonstrate the practical value of our approach in various applications, including gradient flows on manifolds and high-dimensional spaces, as well as a novel sliced OT-based conditional flow matching for image generation -- where fast computation of transport plans is essential.

Paper Structure

This paper contains 20 sections, 5 theorems, 36 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Let $\theta \in \mathbb{R}^q$ and assume $\phi^\theta$ is an injective map on $\mathcal{X}$. Then $d^\theta$ is a distance on $\mathcal{P}(\mathcal{X})$.

Figures (10)

  • Figure 1: 8Gaussians (source) to Two Moons (target) distributions and associated OT plans in Grey: (Left) exact solution (Middle) min-SWGG, that projects samples on an optimal line determined by random sampling (Right) Differentiable Generalized SW plan, that relies on a neural network to get non linear-based ordering of the samples. Gradient of colours represent the ordering of the samples.
  • Figure 2: Example $g$ and $h$ (seen as a function of $\theta$ on the sphere $\mathbb{S}^1$) for a 2D OT problem. $\hat{h}_{\varepsilon, N}$ is a Monte-Carlo estimate of $h_\varepsilon$ with gradient $\hat{\nabla} h_{\varepsilon, N}$. Note that $g$ is continuous whereas $h$ is piecewise constant, hence there is a need for a smoothing mechanism, that results in $\hat{h}_{\varepsilon, N}$.
  • Figure 3: Impact of the variance reduction scheme (first 1,000 iterations).
  • Figure 4: Log of the Wasserstein Distance as a function of the number of iterations of the gradient flow, considering several target distributions. The source distribution is uniform in all cases.
  • Figure 5: Log of the WD (second and fourth panels) for two different targets (first and third ones) as wrapped normal distributions for HHSW, SWD and DGWSP.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Definition 1
  • Proposition 1
  • Definition 2
  • Lemma 1
  • Lemma 2: Stein's lemma
  • Proposition 2
  • proof : Proof of Lemma \ref{['lem:scaling']}
  • Lemma 3
  • proof
  • proof : Proof of Proposition \ref{['prop:distance']}
  • ...and 2 more