LancBiO: dynamic Lanczos-aided bilevel optimization via Krylov subspace

Yan Yang; Bin Gao; Ya-xiang Yuan

LancBiO: dynamic Lanczos-aided bilevel optimization via Krylov subspace

Yan Yang, Bin Gao, Ya-xiang Yuan

TL;DR

This paper tackles the bottleneck in bilevel optimization posed by the Hessian inverse–vector product required for hyper-gradient evaluation. It introduces SubBiO, which confines the Hessian-inverse approximation to a low-dimensional Krylov subspace, and LancBiO, a dynamic Lanczos-based framework with restarts and residual corrections to efficiently and stably approximate the Hessian inverse-vector product over outer iterations. The authors establish a global convergence rate of $\mathcal{O}(\epsilon^{-1})$ and show that the Hessian-vector product cost scales favorably close to $1+\frac{1}{m}$ Hessian-vector products per outer iteration. Empirical results on synthetic problems and two deep learning tasks validate the approach, with LancBiO delivering the most accurate hyper-gradient estimates and smallest linear-system residuals, confirming the practical impact of Krylov-based subspace methods for bilevel optimization.

Abstract

Bilevel optimization, with broad applications in machine learning, has an intricate hierarchical structure. Gradient-based methods have emerged as a common approach to large-scale bilevel problems. However, the computation of the hyper-gradient, which involves a Hessian inverse vector product, confines the efficiency and is regarded as a bottleneck. To circumvent the inverse, we construct a sequence of low-dimensional approximate Krylov subspaces with the aid of the Lanczos process. As a result, the constructed subspace is able to dynamically and incrementally approximate the Hessian inverse vector product with less effort and thus leads to a favorable estimate of the hyper-gradient. Moreover, we propose a provable subspace-based framework for bilevel problems where one central step is to solve a small-size tridiagonal linear system. To the best of our knowledge, this is the first time that subspace techniques are incorporated into bilevel optimization. This successful trial not only enjoys $\mathcal{O}(ε^{-1})$ convergence rate but also demonstrates efficiency in a synthetic problem and two deep learning tasks.

LancBiO: dynamic Lanczos-aided bilevel optimization via Krylov subspace

TL;DR

and show that the Hessian-vector product cost scales favorably close to

Hessian-vector products per outer iteration. Empirical results on synthetic problems and two deep learning tasks validate the approach, with LancBiO delivering the most accurate hyper-gradient estimates and smallest linear-system residuals, confirming the practical impact of Krylov-based subspace methods for bilevel optimization.

Abstract

convergence rate but also demonstrates efficiency in a synthetic problem and two deep learning tasks.

Paper Structure (37 sections, 21 theorems, 167 equations, 17 figures, 2 tables, 4 algorithms)

This paper contains 37 sections, 21 theorems, 167 equations, 17 figures, 2 tables, 4 algorithms.

Introduction
Contributions
Related Work
Subspace-based Algorithms
Why Krylov subspace: the SubBiO algorithm
Why dynamic Lanczos: the LancBiO framework
Relation to existing algorithms
Theoretical Analysis
Subspace Properties in Dynamic Lanczos Process
Convergence Analysis
Numerical Experiments
Related Work in Bilevel Optimization
Krylov Subspace and Lanczos Process
Dynamic Lanczos Subroutine
Extending LancBiO to Non-convex Lower-level Problem
...and 22 more sections

Key Result

Lemma 3.4

Under the Assumptions assu:g and assu:strongg, $y^*(x)$ is ${L_{gx}}/{\mu_g}$ -Lipschitz continuous, i.e., for any $x_1,x_2\in\mathbb{R}^{d_x}$, $\left\| {y^*(x_1)-y^*(x_2)} \right\| \le \frac{L_{gx}}{\mu_g}\left\| {x_1-x_2} \right\|$.

Figures (17)

Figure 1: Left: test loss for the method stocBiO ji2021stocbio with different inner iterations $I$ to approximate the Hessian inverse vector product; Right: Estimation error of the Hessian inverse vector product in hyper-data cleaning task with corruption rate $0.5$ for different methods: LancBiO and SubBiO (ours), AmIGO arbel2022amortized, and SOBA dagreou2022soba.
Figure 2: Illustration of approximating $A^{-1}b\in\mathcal{K}_{{N}}(A,b)$ by $v_n$ in the two-dimensional subspace $\mathcal{S}_n\subseteq\mathcal{K}_{{n}}(A,b)$.
Figure 3: An overview of LancBiO.
Figure 4: Comparison of the bilevel algorithms on data hyper-cleaning task when $p=0.8$. Left: test accuracy; Center: test loss; Right: residual norm of the linear system, $\left\| {A_kv_k-b_k} \right\|$.
Figure 5: Influence of the subspace dimension $m$ on LancBiO. The post-fix of legend represents the subspace dimension $m$ or the inner iteration $I$. Left: norm of the hyper-gradient; Right: residual norm of the linear system, $\left\| {A_kv_k-b_k} \right\|$.
...and 12 more figures

Theorems & Definitions (37)

Remark 2.1
Lemma 3.4
Lemma 3.5
Proposition 3.7
Lemma 3.8
Lemma 3.10
Theorem 3.11
Definition B.1
Remark B.2
Remark B.3
...and 27 more

LancBiO: dynamic Lanczos-aided bilevel optimization via Krylov subspace

TL;DR

Abstract

LancBiO: dynamic Lanczos-aided bilevel optimization via Krylov subspace

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (37)