Table of Contents
Fetching ...

Accelerating optimization over the space of probability measures

Shi Chen, Qin Li, Oliver Tse, Stephen J. Wright

TL;DR

A Hamiltonian-flow approach analogous to momentum-based approaches in Euclidean space is introduced and it is demonstrated that, in the continuous-time setting, algorithms based on this approach can achieve convergence rates of arbitrarily high order.

Abstract

The acceleration of gradient-based optimization methods is a subject of significant practical and theoretical importance, particularly within machine learning applications. While much attention has been directed towards optimizing within Euclidean space, the need to optimize over spaces of probability measures in machine learning motivates exploration of accelerated gradient methods in this context too. To this end, we introduce a Hamiltonian-flow approach analogous to momentum-based approaches in Euclidean space. We demonstrate that, in the continuous-time setting, algorithms based on this approach can achieve convergence rates of arbitrarily high order. We complement our findings with numerical examples.

Accelerating optimization over the space of probability measures

TL;DR

A Hamiltonian-flow approach analogous to momentum-based approaches in Euclidean space is introduced and it is demonstrated that, in the continuous-time setting, algorithms based on this approach can achieve convergence rates of arbitrarily high order.

Abstract

The acceleration of gradient-based optimization methods is a subject of significant practical and theoretical importance, particularly within machine learning applications. While much attention has been directed towards optimizing within Euclidean space, the need to optimize over spaces of probability measures in machine learning motivates exploration of accelerated gradient methods in this context too. To this end, we introduce a Hamiltonian-flow approach analogous to momentum-based approaches in Euclidean space. We demonstrate that, in the continuous-time setting, algorithms based on this approach can achieve convergence rates of arbitrarily high order. We complement our findings with numerical examples.
Paper Structure (24 sections, 10 theorems, 164 equations, 5 figures, 1 table)

This paper contains 24 sections, 10 theorems, 164 equations, 5 figures, 1 table.

Key Result

Proposition 5

The motion of $\delta_{(x(t),v(t))}$, viewed as a probability measure to optimize $E$, agrees with that of $(x(t),v(t))$, viewed as a sample to optimize $f$, if $E$ and $f$, $H_t$ and $h_t$ are related as follows: More precisely, we have the following.

Figures (5)

  • Figure 1: Optimality gap vs time for Wasserstein gradient flow (WGF), Heavy-Ball flow (HB), Nesterov flow (Nes) and exponentially convergent Hamiltonian flow (Exp), for the functionals $\mathcal{V}_1$ (left) and $\mathcal{V}_2$ (right). The functionals are evaluated at empirical measures; see \ref{['eq:he1']}.
  • Figure 2: Total number of steps vs optimality gap (Tol) for Wasserstein gradient flow (WGF), Heavy-Ball flow (HB), Nesterov flow (Nes) and exponentially convergent Hamiltonian flow (Exp), for the functionals $\mathcal{V}_1$ (left) and $\mathcal{V}_2$ (right). The functionals are evaluated at empirical measures; see \ref{['eq:he1']}. The total number of steps includes those accepted and rejected in the adaptive step size controller.
  • Figure 3: Optimality gap for minimization of regularized KL divergence with target $g_1$ (left) and $g_2$ (right) for four methods: Wasserstein gradient flow (WGF), Heavy-Ball flow (HB), Nesterov flow (Nes), and exponentially convergent Hamiltonian flow (Exp). The functionals are evaluated at empirical measures; see \ref{['eq:he1']}.
  • Figure 4: Total number of steps vs optimality gap (Tol) for minimization of regularized KL divergence with target $g_1$ (left) and $g_2$ (right) for four methods: Wasserstein gradient flow (WGF), Heavy-Ball flow (HB), Nesterov flow (Nes), and exponentially convergent Hamiltonian flow (Exp). The functionals are evaluated at empirical measures; see \ref{['eq:he1']}. The total number of steps includes those accepted and rejected in the adaptive step size controller.
  • Figure 5: Left: Mean square errors for neural network training with target $f(x) = \sin(\pi x)$ for four methods: Wasserstein gradient flow (WGF), Heavy-Ball flow (HB), Nesterov flow (Nes), and exponentially convergent Hamiltonian flow (Exp). Middle: The target $f(x) = \sin(\pi x)$ and its neural network approximations obtained by running the four methods for $T=14$. Right: The number of total steps as a function of the mean square error (Tol). The number of total steps includes those accepted and rejected in the adaptive step size controller.

Theorems & Definitions (17)

  • Example 1: Heavy-ball ODE Po:1964some
  • Example 2: Variational acceleration WiWiJo:2016variational
  • Definition 1
  • Definition 2
  • Remark 3
  • Definition 4: Hamiltonian flow over probability measures
  • Proposition 5
  • Proposition 6
  • Theorem 7
  • Proposition 8
  • ...and 7 more