Table of Contents
Fetching ...

Riemannian Adaptive Optimization Methods

Gary Bécigneul, Octavian-Eugen Ganea

TL;DR

This work extends adaptive optimization methods (Adagrad, Adam, Amsgrad) to Riemannian settings by focusing on Cartesian products of manifolds, where each factor acts as a coordinate for adaptivity. It shows intrinsic obstacles to coordinate-wise adaptation on general manifolds and provides a principled Ramsgrad/RadamNc framework that operates across the product structure, with convergence guarantees under geodesic convexity and curvature bounds. Theoretical results quantify regret growth and emphasize the role of curvature through a zeta factor, while experiments on hyperbolic WordNet embeddings demonstrate faster convergence and lower training loss for the proposed methods, surpassing non-adaptive baselines. Overall, the paper delivers both rigorous analysis and practical algorithms for adaptive optimization in non-Euclidean spaces, enabling more efficient learning in non-Euclidean embedding tasks.

Abstract

Several first order stochastic optimization methods commonly used in the Euclidean domain such as stochastic gradient descent (SGD), accelerated gradient descent or variance reduced methods have already been adapted to certain Riemannian settings. However, some of the most popular of these optimization tools - namely Adam , Adagrad and the more recent Amsgrad - remain to be generalized to Riemannian manifolds. We discuss the difficulty of generalizing such adaptive schemes to the most agnostic Riemannian setting, and then provide algorithms and convergence proofs for geodesically convex objectives in the particular case of a product of Riemannian manifolds, in which adaptivity is implemented across manifolds in the cartesian product. Our generalization is tight in the sense that choosing the Euclidean space as Riemannian manifold yields the same algorithms and regret bounds as those that were already known for the standard algorithms. Experimentally, we show faster convergence and to a lower train loss value for Riemannian adaptive methods over their corresponding baselines on the realistic task of embedding the WordNet taxonomy in the Poincare ball.

Riemannian Adaptive Optimization Methods

TL;DR

This work extends adaptive optimization methods (Adagrad, Adam, Amsgrad) to Riemannian settings by focusing on Cartesian products of manifolds, where each factor acts as a coordinate for adaptivity. It shows intrinsic obstacles to coordinate-wise adaptation on general manifolds and provides a principled Ramsgrad/RadamNc framework that operates across the product structure, with convergence guarantees under geodesic convexity and curvature bounds. Theoretical results quantify regret growth and emphasize the role of curvature through a zeta factor, while experiments on hyperbolic WordNet embeddings demonstrate faster convergence and lower training loss for the proposed methods, surpassing non-adaptive baselines. Overall, the paper delivers both rigorous analysis and practical algorithms for adaptive optimization in non-Euclidean spaces, enabling more efficient learning in non-Euclidean embedding tasks.

Abstract

Several first order stochastic optimization methods commonly used in the Euclidean domain such as stochastic gradient descent (SGD), accelerated gradient descent or variance reduced methods have already been adapted to certain Riemannian settings. However, some of the most popular of these optimization tools - namely Adam , Adagrad and the more recent Amsgrad - remain to be generalized to Riemannian manifolds. We discuss the difficulty of generalizing such adaptive schemes to the most agnostic Riemannian setting, and then provide algorithms and convergence proofs for geodesically convex objectives in the particular case of a product of Riemannian manifolds, in which adaptivity is implemented across manifolds in the cartesian product. Our generalization is tight in the sense that choosing the Euclidean space as Riemannian manifold yields the same algorithms and regret bounds as those that were already known for the standard algorithms. Experimentally, we show faster convergence and to a lower train loss value for Riemannian adaptive methods over their corresponding baselines on the realistic task of embedding the WordNet taxonomy in the Poincare ball.

Paper Structure

This paper contains 42 sections, 8 theorems, 48 equations, 3 figures.

Key Result

Theorem 1

Let $(x_t)$ and $(\hat{v}_t)$ be the sequences obtained from Algorithm alg:alg-1, $\alpha_t=\alpha/\sqrt{t}$, $\beta_1=\beta_{11}$, $\beta_{1t}\leq\beta_1$ for all $t\in [T]$ and $\gamma =\beta_1/\sqrt{\beta_2} <1$. We then have:

Figures (3)

  • Figure 1: Comparison of the Riemannian and Euclidean versions of Amsgrad.
  • Figure 2: Results for methods doing updates with the exponential map. From left to right we report: training loss, MAP on the train set, MAP on the validation set.
  • Figure 3: Results for methods doing updates with the retraction. From left to right we report: training loss, MAP on the train set, MAP on the validation set.

Theorems & Definitions (17)

  • Theorem 1: Convergence of Ramsgrad
  • proof
  • Theorem 2: Convergence of RadamNc
  • proof
  • proof
  • Lemma 3
  • proof
  • proof
  • Lemma 4
  • proof
  • ...and 7 more