Table of Contents
Fetching ...

Faster Gradient Methods for Highly-Smooth Stochastic Bilevel Optimization

Lesi Chen, Junru Li, El Mahdi Chayti, Jingzhao Zhang

TL;DR

It is demonstrated that faster rates are achievable for higher-order smooth problems, and the upper bound of F${}^2$SA-$p is nearly optimal in the highly smooth region.

Abstract

This paper studies the complexity of finding an $ε$-stationary point for stochastic bilevel optimization when the upper-level problem is nonconvex and the lower-level problem is strongly convex. Recent work proposed the first-order method, F${}^2$SA, achieving the $\tilde{\mathcal{O}}(ε^{-6})$ upper complexity bound for first-order smooth problems. This is slower than the optimal $Ω(ε^{-4})$ complexity lower bound in its single-level counterpart. In this work, we show that faster rates are achievable for higher-order smooth problems. We first reformulate F$^2$SA as approximating the hyper-gradient with a forward difference. Based on this observation, we propose a class of methods F${}^2$SA-$p$ that uses $p$th-order finite difference for hyper-gradient approximation and improves the upper bound to $\tilde{\mathcal{O}}(p ε^{-4-p/2})$ for $p$th-order smooth problems. Finally, we demonstrate that the $Ω(ε^{-4})$ lower bound also holds for stochastic bilevel problems when the high-order smoothness holds for the lower-level variable, indicating that the upper bound of F${}^2$SA-$p$ is nearly optimal in the highly smooth region $p = Ω( \log ε^{-1} / \log \log ε^{-1})$.

Faster Gradient Methods for Highly-Smooth Stochastic Bilevel Optimization

TL;DR

It is demonstrated that faster rates are achievable for higher-order smooth problems, and the upper bound of FSA-$p is nearly optimal in the highly smooth region.

Abstract

This paper studies the complexity of finding an -stationary point for stochastic bilevel optimization when the upper-level problem is nonconvex and the lower-level problem is strongly convex. Recent work proposed the first-order method, FSA, achieving the upper complexity bound for first-order smooth problems. This is slower than the optimal complexity lower bound in its single-level counterpart. In this work, we show that faster rates are achievable for higher-order smooth problems. We first reformulate FSA as approximating the hyper-gradient with a forward difference. Based on this observation, we propose a class of methods FSA- that uses th-order finite difference for hyper-gradient approximation and improves the upper bound to for th-order smooth problems. Finally, we demonstrate that the lower bound also holds for stochastic bilevel problems when the high-order smoothness holds for the lower-level variable, indicating that the upper bound of FSA- is nearly optimal in the highly smooth region .

Paper Structure

This paper contains 32 sections, 13 theorems, 76 equations, 2 figures, 1 table.

Key Result

Lemma 3.1

Assume the function $\psi: {\mathbb{R}}\rightarrow {\mathbb{R}}^d$ has $C$-Lipschitz continuous $p$th-order derivative. There exist coefficients $\{ \alpha_j\}$ such that If $p$ is even, the indices run $j = -p/2,\cdots, p/2$. If $p$ is odd, they run $j =-(p-1)/2, \cdots, (p+1)/2$. Furthermore, all the coefficients satisfy $\vert j \alpha_j \vert \le 1$ for all $j \ne 0$ and $\vert \alpha_0 \vert

Figures (2)

  • Figure 1: Performances of different algorithms on Example \ref{['exmp:l2reg']}.
  • Figure 2: Performances of different algorithms on Problem (\ref{['eq:deep-l2reg']}) with an MLP model.

Theorems & Definitions (29)

  • Definition 2.1
  • Definition 2.2: $p$th-order smooth bilevel problems
  • Example 2.1: Data hyper-cleaning
  • Example 2.2: Learn-to-regularize
  • Lemma 3.1
  • Remark 3.1: Effect of normalized gradient step
  • Lemma 3.2
  • Remark 3.2: Tighter bounds for $p=2$
  • Theorem 3.1: Main theorem
  • Remark 3.3: First-order smooth region
  • ...and 19 more