Mirror and Preconditioned Gradient Descent in Wasserstein Space

Clément Bonet; Théo Uscidda; Adam David; Pierre-Cyril Aubin-Frankowski; Anna Korba

Mirror and Preconditioned Gradient Descent in Wasserstein Space

Clément Bonet, Théo Uscidda, Adam David, Pierre-Cyril Aubin-Frankowski, Anna Korba

TL;DR

It is shown the advantages of adapting the geometry induced by the regularizer on ill-conditioned optimization tasks, and the improvement of choosing different discrepancies and geometries in a computational biology task of aligning single-cells.

Abstract

As the problem of minimizing functionals on the Wasserstein space encompasses many applications in machine learning, different optimization algorithms on $\mathbb{R}^d$ have received their counterpart analog on the Wasserstein space. We focus here on lifting two explicit algorithms: mirror descent and preconditioned gradient descent. These algorithms have been introduced to better capture the geometry of the function to minimize and are provably convergent under appropriate (namely relative) smoothness and convexity conditions. Adapting these notions to the Wasserstein space, we prove guarantees of convergence of some Wasserstein-gradient-based discrete-time schemes for new pairings of objective functionals and regularizers. The difficulty here is to carefully select along which curves the functionals should be smooth and convex. We illustrate the advantages of adapting the geometry induced by the regularizer on ill-conditioned optimization tasks, and showcase the improvement of choosing different discrepancies and geometries in a computational biology task of aligning single-cells.

Mirror and Preconditioned Gradient Descent in Wasserstein Space

TL;DR

Abstract

As the problem of minimizing functionals on the Wasserstein space encompasses many applications in machine learning, different optimization algorithms on

have received their counterpart analog on the Wasserstein space. We focus here on lifting two explicit algorithms: mirror descent and preconditioned gradient descent. These algorithms have been introduced to better capture the geometry of the function to minimize and are provably convergent under appropriate (namely relative) smoothness and convexity conditions. Adapting these notions to the Wasserstein space, we prove guarantees of convergence of some Wasserstein-gradient-based discrete-time schemes for new pairings of objective functionals and regularizers. The difficulty here is to carefully select along which curves the functionals should be smooth and convex. We illustrate the advantages of adapting the geometry induced by the regularizer on ill-conditioned optimization tasks, and showcase the improvement of choosing different discrepancies and geometries in a computational biology task of aligning single-cells.

Paper Structure (86 sections, 33 theorems, 206 equations, 8 figures)

This paper contains 86 sections, 33 theorems, 206 equations, 8 figures.

Introduction
Contributions.
Background
Bregman divergence on $L^2(\mu)$.
Differentiability on $(\mathcal{P}_2(\mathbb{R}^d ),\mathrm{W}_2)$.
Examples of functionals.
Convexity and smoothness in $(\mathcal{P}_2(\mathbb{R}^d ),\mathrm{W}_2)$.
Mirror descent and preconditioned gradient descent on $\mathbb{R}^d$.
Mirror descent
Iterates of mirror descent.
Implementation.
Preconditioned gradient descent
Convergence guarantees.
Applications and Experiments
Relative convexity of functionals.
...and 71 more sections

Key Result

Proposition 1

Let $\mathcal{F}:\mathcal{P}_2(\mathbb{R}^d)\to\mathbb{R}\cup \{+\infty\}$ be a Wasserstein differentiable functional on $D(\mathcal{F})$. Let $\mu\in\mathcal{P}_2(\mathbb{R}^d)$ and $\Tilde{\mathcal{F}}_\mu(\mathrm{T}) = \mathcal{F}(\mathrm{T}_\#\mu)$ for all $\mathrm{T}\in D(\Tilde{\mathcal{F}}_\m

Figures (8)

Figure 1: (Left) Value of $\mathcal{W}$ along the flow for two difference interaction Bregman potentials, (Middle and Right) Trajectories of particles to minimize $\mathcal{W}$.
Figure 2: Convergence towards Gaussians $\mathcal{N}(0,UDU^T)$ averaged over 20 covariances, with $U\sim\mathrm{Unif}(O_{10}(\mathbb{R}))$ and $D$ fixed.
Figure 3: Preconditioned GD vs. (vanilla) GD to predict the responses of cell populations to cancer treatment on 4i (Upper row) and scRNAseq (Lower row) datasets. For each treatment, starting from the untreated cells $\mu_i$, we minimize $\mathcal{F}(\mu)=D(\mu,\nu_i)$ with $\nu_i$ the treated cells. The plot is organized as pairs of columns, each corresponding to optimizing a specific metric, with two scatter plots displaying points $z_i = (x_i, y_i)$ where (First column)$y_i$ is the attained minima $\mathcal{F}(\hat{\mu}) = D(\hat{\mu}, \nu_i)$ with preconditioning and $x_i$ that without preconditioning, and (Second column)$y_i$ is the number of iterations to reach convergence with preconditioning and $x_i$ that without preconditioning. A point below the diagonal $y=x$ then refers to an experiment in which preconditioning provides (First column) a better minima or (Second column) faster convergence. We assign a color to each treatment and plot three runs, obtained with three different initializations, along with their mean (brighter point).
Figure 4: (Left) Value of $\mathcal{W}$ along the flow for two difference interaction Bregman potentials, (Right) Trajectories of particles to minimize $\mathcal{W}$.
Figure 5: Convergence towards Gaussian $\mathcal{N}(0,D)$ with $D$ diagonal and uniformly sampled on $[0,50]^{10}$.
...and 3 more figures

Theorems & Definitions (73)

Definition 1
Definition 2: Relative smoothness and convexity
Proposition 1
Definition 3
Proposition 2
Proposition 3
Proposition 4
Proposition 5
Proposition 6
Proposition 7
...and 63 more

Mirror and Preconditioned Gradient Descent in Wasserstein Space

TL;DR

Abstract

Mirror and Preconditioned Gradient Descent in Wasserstein Space

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (73)