A general framework of Riemannian adaptive optimization methods with a convergence analysis

Hiroyuki Sakai; Hideaki Iiduka

A general framework of Riemannian adaptive optimization methods with a convergence analysis

Hiroyuki Sakai, Hideaki Iiduka

TL;DR

The paper addresses stochastic optimization on Riemannian manifolds by introducing a general framework for adaptive methods on embedded submanifolds of $\,\mathbb{R}^d$, unifying algorithms such as SGD, AdaGrad, RMSProp, Adam, and AMSGrad via tangent-space projections. It presents RAMSGrad as a direct extension of AMSGrad to embedded submanifolds and provides convergence analyses for both constant and diminishing step sizes, including scenarios with increasing mini-batch sizes; the rates scale as $\mathcal{O}\left(\frac{1}{K}+\frac{1}{b}\right)$ for constant steps and $\mathcal{O}\left(\left(1+\frac{1}{b}\right)\frac{\log K}{\sqrt{K}}\right)$ for diminishing steps, with improvements when $b_k$ grows. The theoretical framework hinges on projecting adaptive updates onto the tangent spaces via $P_x$ and leveraging retraction-Lipschitz smoothness to establish descent. Numerical experiments on PCA (Stiefel) and LRMC (Grassmann) datasets demonstrate RAMSGrad and RAdam competitive performance, validating both the convergence theory and practical effectiveness on Riemannian optimization problems.

Abstract

This paper proposes a general framework of Riemannian adaptive optimization methods. The framework encapsulates several stochastic optimization algorithms on Riemannian manifolds and incorporates the mini-batch strategy that is often used in deep learning. Within this framework, we also propose AMSGrad on embedded submanifolds of Euclidean space. Moreover, we give convergence analyses valid for both a constant and a diminishing step size. Our analyses also reveal the relationship between the convergence rate and mini-batch size. In numerical experiments, we applied the proposed algorithm to principal component analysis and the low-rank matrix completion problem, which can be considered to be Riemannian optimization problems. Python implementations of the methods used in the numerical experiments are available at https://github.com/iiduka-researches/202408-adaptive.

A general framework of Riemannian adaptive optimization methods with a convergence analysis

TL;DR

The paper addresses stochastic optimization on Riemannian manifolds by introducing a general framework for adaptive methods on embedded submanifolds of

, unifying algorithms such as SGD, AdaGrad, RMSProp, Adam, and AMSGrad via tangent-space projections. It presents RAMSGrad as a direct extension of AMSGrad to embedded submanifolds and provides convergence analyses for both constant and diminishing step sizes, including scenarios with increasing mini-batch sizes; the rates scale as

for constant steps and

for diminishing steps, with improvements when

grows. The theoretical framework hinges on projecting adaptive updates onto the tangent spaces via

and leveraging retraction-Lipschitz smoothness to establish descent. Numerical experiments on PCA (Stiefel) and LRMC (Grassmann) datasets demonstrate RAMSGrad and RAdam competitive performance, validating both the convergence theory and practical effectiveness on Riemannian optimization problems.

Abstract

Paper Structure (21 sections, 12 theorems, 88 equations, 32 figures, 16 tables, 2 algorithms)

This paper contains 21 sections, 12 theorems, 88 equations, 32 figures, 16 tables, 2 algorithms.

Introduction
Motivations
Contributions
Mathematical preliminaries
Examples
Riemannian stochastic optimization problem
Proposed general framework of Riemannian adaptive methods
Convergence analysis
Assumptions and useful lemmas
Convergence analysis of Algorithm \ref{['alg:general']}
Numerical experiments
Principal component analysis
Low-rank matrix completion
Conclusion
Useful Lemmas
...and 6 more sections

Key Result

Proposition 3.2

Suppose that Assumption asm:mainasm:Lipschitz holds. Then, for all $x\in M$ and $\eta\in T_xM$.

Figures (32)

Figure 1: Objective function value defined by \ref{['eq:pca']} versus number of iterations on the training set of the MNIST datasets.
Figure 2: Objective function value defined by \ref{['eq:pca']} versus number of iterations on the test set of the MNIST datasets.
Figure 3: Norm of the gradient of objective function defined by \ref{['eq:pca']} versus number of iterations on the training set of the MNIST datasets.
Figure 4: Norm of the gradient of objective function defined by \ref{['eq:pca']} versus number of iterations on the test set of the MNIST datasets.
Figure 5: Objective function value defined by \ref{['eq:pca']} versus number of iterations on the training set of the COIL100 datasets.
...and 27 more figures

Theorems & Definitions (25)

Definition 2.1: Retraction
Proposition 3.2
Lemma 3.3
proof
Theorem 3.4
proof
Theorem 3.5
proof
Theorem 3.6
proof
...and 15 more

A general framework of Riemannian adaptive optimization methods with a convergence analysis

TL;DR

Abstract

A general framework of Riemannian adaptive optimization methods with a convergence analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (32)

Theorems & Definitions (25)