Warped geometric information on the optimisation of Euclidean functions

Marcelo Hartmann; Bernardo Williams; Hanlin Yu; Mark Girolami; Alessandro Barp; Arto Klami

Warped geometric information on the optimisation of Euclidean functions

Marcelo Hartmann, Bernardo Williams, Hanlin Yu, Mark Girolami, Alessandro Barp, Arto Klami

Abstract

We consider the fundamental task of optimising a real-valued function defined in a potentially high-dimensional Euclidean space, such as the loss function in many machine-learning tasks or the logarithm of the probability distribution in statistical inference. We use Riemannian geometry notions to redefine the optimisation problem of a function on the Euclidean space to a Riemannian manifold with a warped metric, and then find the function's optimum along this manifold. The warped metric chosen for the search domain induces a computational friendly metric-tensor for which optimal search directions associated with geodesic curves on the manifold becomes easier to compute. Performing optimization along geodesics is known to be generally infeasible, yet we show that in this specific manifold we can analytically derive Taylor approximations up to third-order. In general these approximations to the geodesic curve will not lie on the manifold, however we construct suitable retraction maps to pull them back onto the manifold. Therefore, we can efficiently optimize along the approximate geodesic curves. We cover the related theory, describe a practical optimization algorithm and empirically evaluate it on a collection of challenging optimisation benchmarks. Our proposed algorithm, using 3rd-order approximation of geodesics, tends to outperform standard Euclidean gradient-based counterparts in term of number of iterations until convergence.

Warped geometric information on the optimisation of Euclidean functions

Abstract

Paper Structure (22 sections, 50 equations, 7 figures, 1 algorithm)

This paper contains 22 sections, 50 equations, 7 figures, 1 algorithm.

Introduction
Preliminaries and notation
Problem formulation and method overview
Riemannian conjugate gradient (RCG) with backward retraction
Third-order geodesic approximation
Retraction choice
Vector transport as inverse backward retraction
The novel RCG algorithm
Experiments
The D-dimensional squiggle probability model
The generalized Rosenbrock function
A test-set of the CUTE library
Discussion and concluding remarks
Simplification of dot-products on tangent spaces
Derivation of the Riemannian gradient as the Natural gradient
...and 7 more sections

Figures (7)

Figure 1: Visual interpretation of the domain of the functions $\ell$ and $f$. On the left panel, the plane region $(\theta_1, \theta_2) \in \Theta$ is to be understood as Euclidean. The coloured surface depicts where the function $f$ is defined on the graph of $\ell$, that is on $\Gamma_\ell$ and the ambient space as $\mathop{\mathrm{\mathcal{N} \times \mathcal{M}_\psi}}\nolimits$. In this example the function $\ell(\mathop{\mathrm{\boldsymbol{\theta}}}\nolimits) = \log \mathcal{G}([\theta_1, \theta_2 + \sin(\theta_1)]|\mathop{\mathrm{\boldsymbol{\mu}}}\nolimits, \Sigma )$ where $\mathcal{G}$ denotes the Gaussian density $\mathop{\mathrm{\boldsymbol{\mu}}}\nolimits = \mathop{\mathrm{\boldsymbol{0}}}\nolimits$ and $\Sigma = \mathop{\mathrm{\mathrm{diag}}}\nolimits(1, 0.01)$. The set $\Gamma_\ell$ has element $\mathop{\mathrm{\boldsymbol{x}}}\nolimits = (\mathop{\mathrm{\boldsymbol{\theta}}}\nolimits, \ell(\mathop{\mathrm{\boldsymbol{\theta}}}\nolimits))$ and is showed on the "height" axis. This set can be understood as a embedded Riemannian manifold in the higher-dimensional space $\mathop{\mathrm{\mathcal{N} \times \mathcal{M}_\psi}}\nolimits = \mathbb{R}^3$ (associated with the warped metric). On the right panel we show the behaviour of the domain of $f$ as a function of a given warp function $\psi$. As $\psi$ is closer to zero, the closer to Euclidean the set (or geometry) $\Gamma_\ell$ is.
Figure 2: Number of iterations until convergence for a variety of dimensions using the squiggle model. The RCG (ours) in Algorithm 1, presents, in general, faster convergence than the, CG-exact (ours) and CG-inexact. Both RCG (ours) and CG-exact (ours), are comparable to the ND-inexact that also converges fast (in number of iterations) for all cases when compared to CG-inexact. Here the dimension is taken up to $D = 250$ (See Appendix \ref{['app:comp_cost']} for extra information).
Figure 3: This figure show the number of iterations until convergence considering the Rosenbrock function with varying dimensions $D$. The parameters of the function were set to be $a = 1$ and $b = 100$ as it is usually done in benchmark settings. The RCG (ours), tends to converge faster than the CG counterparts and slower when compared to ND-inexact. See also Appendix \ref{['app:comp_cost']} for other types of computational cost and discussions.
Figure 4: This figure depicts the convergence of the RCG (ours), CG-exact (ours), CG-inexact and ND-exact in the test set of problems from the CUTE library. The RCG (ours) method generally reaches smaller number of iterations to satisfy the stopping criteria when compared to CG counterparts. The ND-inexact achieves smallest number of iterations in comparison to RCG (ours) and CG counterparts for the second and third CUTE models for all dimensions. The computational cost in wall-clock time and memory requirement is generally larger for the RCG (ours) implementation.
Figure 5: Geodesic approximations based on $3^{\textrm{rd}}$-order Taylor series. The level set in gray represents the function $\ell(\mathop{\mathrm{\boldsymbol{\theta}}}\nolimits) = \log \mathcal{N}([\theta_1, \theta_2 + \sin(1.3 \theta_1)]|\mathop{\mathrm{\boldsymbol{\mu}}}\nolimits, \Sigma )$ where $\mathcal{N}$ denotes the Gaussian density $\mathop{\mathrm{\boldsymbol{\mu}}}\nolimits = \mathop{\mathrm{\boldsymbol{0}}}\nolimits$ and $\Sigma = \mathop{\mathrm{\mathrm{diag}}}\nolimits(20, 0.1)$. The blue point $\xi^{-1}(\mathop{\mathrm{\boldsymbol{x}}}\nolimits) = [3.0 \ 1.4]^\top$ and blue vector $\mathop{\mathrm{\boldsymbol{v}}}\nolimits = [-1.2 \ -1.0]^\top$ display the point and direction where the approximation of the geodesic curve on is made for a series of increasing $\sigma^2$ values. As $\sigma^2$ values increase the approximations tend to be closer to a straight line and in the limit of $\sigma^2 \rightarrow \infty$ the geodesic approximation becomes aligned with the search direction $\mathop{\mathrm{\boldsymbol{v}}}\nolimits$.
...and 2 more figures

Warped geometric information on the optimisation of Euclidean functions

Abstract

Warped geometric information on the optimisation of Euclidean functions

Authors

Abstract

Table of Contents

Figures (7)