Table of Contents
Fetching ...

An Adaptive Algorithm for Bilevel Optimization on Riemannian Manifolds

Xu Shi, Rufeng Xiao, Rujun Jiang

TL;DR

We address Riemannian bilevel optimization (RBO) where step sizes typically require problem-specific curvature and Lipschitz constants. We propose AdaRHD, a fully adaptive hypergradient-descent method that updates step sizes via the inverse cumulative gradient norm, eliminating prior parameter knowledge. Theoretical results show an $\mathcal{O}(1/\epsilon)$ iteration complexity to obtain an $\epsilon$-stationary point, with gradient and Hessian-vector complexities mirroring non-adaptive methods; this extends to retraction mappings without sacrificing the rate. Empirical results on simple and robust RBO problems demonstrate competitive performance and enhanced robustness, validating AdaRHD as a practical, parameter-free solver for Riemannian bilevel problems. Future work includes single-loop adaptive schemes and stochastic extensions to further close remaining complexity gaps.

Abstract

Existing methods for solving Riemannian bilevel optimization (RBO) problems require prior knowledge of the problem's first- and second-order information and curvature parameter of the Riemannian manifold to determine step sizes, which poses practical limitations when these parameters are unknown or computationally infeasible to obtain. In this paper, we introduce the Adaptive Riemannian Hypergradient Descent (AdaRHD) algorithm for solving RBO problems. To our knowledge, AdaRHD is the first method to incorporate a fully adaptive step size strategy that eliminates the need for problem-specific parameters in RBO. We prove that AdaRHD achieves an $\mathcal{O}(1/ε)$ iteration complexity for finding an $ε$-stationary point, thus matching the complexity of existing non-adaptive methods. Furthermore, we demonstrate that substituting exponential mappings with retraction mappings maintains the same complexity bound. Experiments demonstrate that AdaRHD achieves comparable performance to existing non-adaptive approaches while exhibiting greater robustness.

An Adaptive Algorithm for Bilevel Optimization on Riemannian Manifolds

TL;DR

We address Riemannian bilevel optimization (RBO) where step sizes typically require problem-specific curvature and Lipschitz constants. We propose AdaRHD, a fully adaptive hypergradient-descent method that updates step sizes via the inverse cumulative gradient norm, eliminating prior parameter knowledge. Theoretical results show an iteration complexity to obtain an -stationary point, with gradient and Hessian-vector complexities mirroring non-adaptive methods; this extends to retraction mappings without sacrificing the rate. Empirical results on simple and robust RBO problems demonstrate competitive performance and enhanced robustness, validating AdaRHD as a practical, parameter-free solver for Riemannian bilevel problems. Future work includes single-loop adaptive schemes and stochastic extensions to further close remaining complexity gaps.

Abstract

Existing methods for solving Riemannian bilevel optimization (RBO) problems require prior knowledge of the problem's first- and second-order information and curvature parameter of the Riemannian manifold to determine step sizes, which poses practical limitations when these parameters are unknown or computationally infeasible to obtain. In this paper, we introduce the Adaptive Riemannian Hypergradient Descent (AdaRHD) algorithm for solving RBO problems. To our knowledge, AdaRHD is the first method to incorporate a fully adaptive step size strategy that eliminates the need for problem-specific parameters in RBO. We prove that AdaRHD achieves an iteration complexity for finding an -stationary point, thus matching the complexity of existing non-adaptive methods. Furthermore, we demonstrate that substituting exponential mappings with retraction mappings maintains the same complexity bound. Experiments demonstrate that AdaRHD achieves comparable performance to existing non-adaptive approaches while exhibiting greater robustness.

Paper Structure

This paper contains 39 sections, 24 theorems, 138 equations, 6 figures, 3 tables, 4 algorithms.

Key Result

Proposition 2.1

For a function $f: {\mathcal{M}} \rightarrow {\mathbb R}$, if its Riemannian gradient ${\mathcal{G}} f$ is $L$-Lipschitz continuous, then for all $x,y \in {\mathcal{U}} \subseteq {\mathcal{M}}$, it holds that If $f$ is $\mu$-geodesic strongly convex, then for all $x,y \in {\mathcal{U}}$, it holds that

Figures (6)

  • Figure 1: Performances of methods in $n=100$.
  • Figure 2: Performances of methods in $n=1000$.
  • Figure 3: Epoch vs. validation accuracy and ergodic performance $\min_{i\in[0,t]}\|\widehat{{\mathcal{G}}} F(x_i, y_i^{K_i}, v_i^{N_i})\|_{x_i}^2$ under different initial step sizes for each algorithm. In the figures for "AdaRHD-X", the labels indicate the values of $1/a_0 = 1/b_0 = 1/c_0$; in the figures for "RHGD-X", the labels represent the values of $\eta_x = \eta_y$.
  • Figure 4: Shallow hyper-representation for regression (Left two column: $n=200$, Right two column: $n=1000$).
  • Figure 5: Deep hyper-representation for classification (Left: sampling ratio 12.5%, Right: sampling ratio 25%).
  • ...and 1 more figures

Theorems & Definitions (43)

  • Proposition 2.1: boumal2023introductionhan2024frameworkli2025riemannian
  • Proposition 3.1
  • Definition 3.1: $\epsilon$-stationary point
  • Lemma 3.1: Hypergradient approximation error bound
  • Proposition 3.2
  • Remark 3.1
  • Theorem 3.1
  • Corollary 3.1
  • Theorem 3.2
  • Definition B.1
  • ...and 33 more