Table of Contents
Fetching ...

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Daniel Dodd, Louis Sharrock, Christopher Nemeth

TL;DR

This work addresses the sensitivity of stochastic optimization on Riemannian manifolds to learning-rate choices by introducing learning-rate-free algorithms. The three main methods—RDoG, NRDoG, and RDoWG—leverage Distance over Gradients and Distance over Weighted Gradients to adapt step sizes without prior knowledge of the optimum distance, achieving high-probability convergence rates that are optimal up to logarithmic factors. The framework supports bounded-iterate guarantees and includes curvature-aware and curvature-omitting variants, with theoretical results complemented by experiments on Rayleigh quotient on the sphere, Grassmann PCA, and Poincaré ball embeddings showing robustness against initialization and hyperparameter tuning. Overall, these LR-free approaches offer robust, scalable alternatives for geodesically convex stochastic optimization with practical impact across manifold-structured learning tasks.

Abstract

In recent years, interest in gradient-based optimization over Riemannian manifolds has surged. However, a significant challenge lies in the reliance on hyperparameters, especially the learning rate, which requires meticulous tuning by practitioners to ensure convergence at a suitable rate. In this work, we introduce innovative learning-rate-free algorithms for stochastic optimization over Riemannian manifolds, eliminating the need for hand-tuning and providing a more robust and user-friendly approach. We establish high probability convergence guarantees that are optimal, up to logarithmic factors, compared to the best-known optimally tuned rate in the deterministic setting. Our approach is validated through numerical experiments, demonstrating competitive performance against learning-rate-dependent algorithms.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

TL;DR

This work addresses the sensitivity of stochastic optimization on Riemannian manifolds to learning-rate choices by introducing learning-rate-free algorithms. The three main methods—RDoG, NRDoG, and RDoWG—leverage Distance over Gradients and Distance over Weighted Gradients to adapt step sizes without prior knowledge of the optimum distance, achieving high-probability convergence rates that are optimal up to logarithmic factors. The framework supports bounded-iterate guarantees and includes curvature-aware and curvature-omitting variants, with theoretical results complemented by experiments on Rayleigh quotient on the sphere, Grassmann PCA, and Poincaré ball embeddings showing robustness against initialization and hyperparameter tuning. Overall, these LR-free approaches offer robust, scalable alternatives for geodesically convex stochastic optimization with practical impact across manifold-structured learning tasks.

Abstract

In recent years, interest in gradient-based optimization over Riemannian manifolds has surged. However, a significant challenge lies in the reliance on hyperparameters, especially the learning rate, which requires meticulous tuning by practitioners to ensure convergence at a suitable rate. In this work, we introduce innovative learning-rate-free algorithms for stochastic optimization over Riemannian manifolds, eliminating the need for hand-tuning and providing a more robust and user-friendly approach. We establish high probability convergence guarantees that are optimal, up to logarithmic factors, compared to the best-known optimally tuned rate in the deterministic setting. Our approach is validated through numerical experiments, demonstrating competitive performance against learning-rate-dependent algorithms.
Paper Structure (47 sections, 59 theorems, 183 equations, 8 figures, 4 tables, 4 algorithms)

This paper contains 47 sections, 59 theorems, 183 equations, 8 figures, 4 tables, 4 algorithms.

Key Result

Lemma 2.1

zhang16 Suppose $a, b, c$ are the side lengths of a geodesic triangle $\Delta$ in a Riemannian manifold with sectional curvature lower bounded by $\kappa>-\infty$ and $A$ is the angle between sides $b$ and $c$ (defined through the inverse exponential map and inner product in tangent space). Then where $\zeta_\kappa \colon \mathbb{R}_{+} \to \mathbb{R}$ is the geometric curvature function

Figures (8)

  • Figure 1: Rayleigh quotient maximization on the unit sphere. Our algorithm, RDoG, converges without tuning, while RSGD shows sensitivity to the learning rate, leading to (a) overshooting or (b) slow convergence.
  • Figure 2: Results for Rayleigh quotient maximization on the sphere. (a) Geodesic distance between the final iterate and the numerical solution after $T=5000$ iterations as a function of the learning rate for RADAM and RSGD and as a function of the initial distance estimate for RDoG, RDoWG, and NRDoG. (b) Shows the regret (the function value of each iterate minus the function value of the numerical solution) for RSGD for a selection of learning rates. (c) Shows the regret for RDoG for a selection of different initial distance estimates. Results are averaged over ten replications.
  • Figure 3: Results for PCA on the Grassmann manifold. (a)-(c) Geodesic distance between the final iterate and the numerical solution after $T=2000$ iterations as a function of the learning rate for RADAM and RSGD and as a function of the initial distance estimate for RDoG, RDoWG, and NRDoG. (b)-(c) Uses the final iterate of the weighted average sequence for RDoG, RDoWG, and NRDog. Results are averaged over five replications.
  • Figure 4: Results for Poincaré word embeddings. (a) The mean average precision of the embeddings is assessed against the ground truth after 1000 training epochs. Results are averaged over five replications, with the embedding dimension set to five. (b)-(c) Two-dimensional embeddings after 2000 training epochs are visualized and annotated for the first 50 nouns of the mammal's subtree for RDoG and RSGD.
  • Figure 5: Supplementary results for Rayleigh quotient maximization on the sphere (\ref{['experiments:rayleigh']}). The plots depict regret as a function of the iteration, considering various learning rates. Results are averaged over ten random replications. The optimal RSGD is chosen based on minimizing the regret after 5000 iterations. Note that (a) and (b) are equivalent to \ref{['fig:rayleigh']} (b) and (c) respectively.
  • ...and 3 more figures

Theorems & Definitions (114)

  • Lemma 2.1
  • proof
  • Theorem 3.5
  • Theorem 3.6
  • Corollary 3.7
  • Remark 3.8
  • Remark 3.9
  • Remark 3.10
  • Theorem 3.11
  • Theorem 3.12
  • ...and 104 more