Accelerating optimization over the space of probability measures

Shi Chen; Qin Li; Oliver Tse; Stephen J. Wright

Accelerating optimization over the space of probability measures

Shi Chen, Qin Li, Oliver Tse, Stephen J. Wright

TL;DR

A Hamiltonian-flow approach analogous to momentum-based approaches in Euclidean space is introduced and it is demonstrated that, in the continuous-time setting, algorithms based on this approach can achieve convergence rates of arbitrarily high order.

Abstract

The acceleration of gradient-based optimization methods is a subject of significant practical and theoretical importance, particularly within machine learning applications. While much attention has been directed towards optimizing within Euclidean space, the need to optimize over spaces of probability measures in machine learning motivates exploration of accelerated gradient methods in this context too. To this end, we introduce a Hamiltonian-flow approach analogous to momentum-based approaches in Euclidean space. We demonstrate that, in the continuous-time setting, algorithms based on this approach can achieve convergence rates of arbitrarily high order. We complement our findings with numerical examples.

Accelerating optimization over the space of probability measures

TL;DR

Abstract

Paper Structure (24 sections, 10 theorems, 164 equations, 5 figures, 1 table)

This paper contains 24 sections, 10 theorems, 164 equations, 5 figures, 1 table.

Introduction
Heavy-ball flow.
Variational acceleration flow.
Summary of Related Work
Organization of the paper
Background knowledge
Hamiltonian flows
Wasserstein metrics and induced convexity
Hamiltonian flows for optimizing in the space of probability measures
Heavy-Ball Flow
Variational Acceleration Flow
The Heavy-Ball Flow
Preliminary Results
Convex case
Strongly Convex Case
...and 9 more sections

Key Result

Proposition 5

The motion of $\delta_{(x(t),v(t))}$, viewed as a probability measure to optimize $E$, agrees with that of $(x(t),v(t))$, viewed as a sample to optimize $f$, if $E$ and $f$, $H_t$ and $h_t$ are related as follows: More precisely, we have the following.

Figures (5)

Figure 1: Optimality gap vs time for Wasserstein gradient flow (WGF), Heavy-Ball flow (HB), Nesterov flow (Nes) and exponentially convergent Hamiltonian flow (Exp), for the functionals $\mathcal{V}_1$ (left) and $\mathcal{V}_2$ (right). The functionals are evaluated at empirical measures; see \ref{['eq:he1']}.
Figure 2: Total number of steps vs optimality gap (Tol) for Wasserstein gradient flow (WGF), Heavy-Ball flow (HB), Nesterov flow (Nes) and exponentially convergent Hamiltonian flow (Exp), for the functionals $\mathcal{V}_1$ (left) and $\mathcal{V}_2$ (right). The functionals are evaluated at empirical measures; see \ref{['eq:he1']}. The total number of steps includes those accepted and rejected in the adaptive step size controller.
Figure 3: Optimality gap for minimization of regularized KL divergence with target $g_1$ (left) and $g_2$ (right) for four methods: Wasserstein gradient flow (WGF), Heavy-Ball flow (HB), Nesterov flow (Nes), and exponentially convergent Hamiltonian flow (Exp). The functionals are evaluated at empirical measures; see \ref{['eq:he1']}.
Figure 4: Total number of steps vs optimality gap (Tol) for minimization of regularized KL divergence with target $g_1$ (left) and $g_2$ (right) for four methods: Wasserstein gradient flow (WGF), Heavy-Ball flow (HB), Nesterov flow (Nes), and exponentially convergent Hamiltonian flow (Exp). The functionals are evaluated at empirical measures; see \ref{['eq:he1']}. The total number of steps includes those accepted and rejected in the adaptive step size controller.
Figure 5: Left: Mean square errors for neural network training with target $f(x) = \sin(\pi x)$ for four methods: Wasserstein gradient flow (WGF), Heavy-Ball flow (HB), Nesterov flow (Nes), and exponentially convergent Hamiltonian flow (Exp). Middle: The target $f(x) = \sin(\pi x)$ and its neural network approximations obtained by running the four methods for $T=14$. Right: The number of total steps as a function of the mean square error (Tol). The number of total steps includes those accepted and rejected in the adaptive step size controller.

Theorems & Definitions (17)

Example 1: Heavy-ball ODE Po:1964some
Example 2: Variational acceleration WiWiJo:2016variational
Definition 1
Definition 2
Remark 3
Definition 4: Hamiltonian flow over probability measures
Proposition 5
Proposition 6
Theorem 7
Proposition 8
...and 7 more

Accelerating optimization over the space of probability measures

TL;DR

Abstract

Accelerating optimization over the space of probability measures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (17)