Table of Contents
Fetching ...

Alternating Stochastic Variance-Reduced Algorithms with Optimal Complexity for Bilevel Optimization

Haimei Huo, Zhixun Su

TL;DR

The paper tackles nonconvex-strongly-convex bilevel optimization by deriving an explicit hypergradient and proposing two alternating variance-reduced algorithms, ALS-SPIDER and ALS-STORM. ALS-SPIDER uses SPIDER-based variance reduction with multi-step LL updates and an auxiliary variable to estimate the hypergradient, achieving an optimal sample complexity of $O(\epsilon^{-1.5})$ under standard assumptions. ALS-STORM replaces SPIDER with STORM to reduce per-iteration batch sizes, while preserving the same $O(\epsilon^{-1.5})$ rate, requiring a large batch only in the initial iteration. Theoretical results are complemented by experiments on synthetic and data-hyper-cleaning tasks, showing practical efficiency improvements and the viability of the two-level alternating framework for bilevel problems.

Abstract

This paper studies the unconstrained nonconvex-strongly-convex bilevel optimization problem. A common approach to solving this problem is to alternately update the upper-level and lower-level variables using (biased) stochastic gradients or their variants, with the lower-level variable updated either one step or multiple steps. In this context, we propose two alternating stochastic variance-reduced algorithms, namely ALS-SPIDER and ALS-STORM, which introduce an auxiliary variable to estimate the hypergradient for updating the upper-level variable. ALS-SPIDER employs the SPIDER estimator for updating variables, while ALS-STORM is a modification of ALS-SPIDER designed to avoid using large batch sizes in every iteration. Theoretically, both algorithms can find an $ε$-stationary point of the bilevel problem with a sample complexity of $O(ε^{-1.5})$ for arbitrary constant number of lower-level variable updates. To the best of our knowledge, they are the first algorithms to achieve the optimal complexity of $O(ε^{-1.5})$ when performing multiple updates on the lower-level variable. Numerical experiments are conducted to illustrate the efficiency of our algorithms.

Alternating Stochastic Variance-Reduced Algorithms with Optimal Complexity for Bilevel Optimization

TL;DR

The paper tackles nonconvex-strongly-convex bilevel optimization by deriving an explicit hypergradient and proposing two alternating variance-reduced algorithms, ALS-SPIDER and ALS-STORM. ALS-SPIDER uses SPIDER-based variance reduction with multi-step LL updates and an auxiliary variable to estimate the hypergradient, achieving an optimal sample complexity of under standard assumptions. ALS-STORM replaces SPIDER with STORM to reduce per-iteration batch sizes, while preserving the same rate, requiring a large batch only in the initial iteration. Theoretical results are complemented by experiments on synthetic and data-hyper-cleaning tasks, showing practical efficiency improvements and the viability of the two-level alternating framework for bilevel problems.

Abstract

This paper studies the unconstrained nonconvex-strongly-convex bilevel optimization problem. A common approach to solving this problem is to alternately update the upper-level and lower-level variables using (biased) stochastic gradients or their variants, with the lower-level variable updated either one step or multiple steps. In this context, we propose two alternating stochastic variance-reduced algorithms, namely ALS-SPIDER and ALS-STORM, which introduce an auxiliary variable to estimate the hypergradient for updating the upper-level variable. ALS-SPIDER employs the SPIDER estimator for updating variables, while ALS-STORM is a modification of ALS-SPIDER designed to avoid using large batch sizes in every iteration. Theoretically, both algorithms can find an -stationary point of the bilevel problem with a sample complexity of for arbitrary constant number of lower-level variable updates. To the best of our knowledge, they are the first algorithms to achieve the optimal complexity of when performing multiple updates on the lower-level variable. Numerical experiments are conducted to illustrate the efficiency of our algorithms.
Paper Structure (14 sections, 8 theorems, 98 equations, 2 figures, 4 algorithms)

This paper contains 14 sections, 8 theorems, 98 equations, 2 figures, 4 algorithms.

Key Result

Lemma 1

Suppose Assumptions assum:1 and assum:2 hold. Then, $\nabla \Phi(x)$ in (eq2), $y^*(x)$ in problem (eq1), and $v^*(x)$ in (eq4) are respectively $L_{\Phi}$-, $C_{y}$-, and $C_{v}$-Lipschitz continuous, where

Figures (2)

  • Figure 1: Comparison of VRBO and ALS-SPIDER (in the first two columns), and ALS-SPIDER and ALS-STORM (in the last two columns) for the synthetic bilevel problem. We show the convergence performance w.r.t. outer loop iteration $k$ (resp. running time) in the first (resp. second) row.
  • Figure 2: Comparison of VRBO and ALS-SPIDER (in the first two columns), and ALS-SPIDER and ALS-STORM (in the last two columns) for the data hyper-cleaning task on the FashionMNIST dataset. Results w.r.t. outer iteration $k$ (resp. running time) are in the first (resp. second) row.

Theorems & Definitions (20)

  • Lemma 1
  • Definition 1
  • Definition 2
  • Remark 1
  • Remark 2
  • Lemma 2
  • Proof 1
  • Lemma 3
  • Proof 2
  • Lemma 4
  • ...and 10 more