Table of Contents
Fetching ...

Bilevel Optimization with Lower-Level Uniform Convexity: Theory and Algorithm

Yuman Wu, Xiaochuan Gong, Jie Hao, Mingrui Liu

TL;DR

A tractable class of bilevel optimization problems that interpolates between lower-level strong convexity and general convexity via lower-level uniform convexity is identified, and a novel implicit differentiation theorem is established characterizing the hyperobjective's smoothness property.

Abstract

Bilevel optimization is a hierarchical framework where an upper-level optimization problem is constrained by a lower-level problem, commonly used in machine learning applications such as hyperparameter optimization. Existing bilevel optimization methods typically assume strong convexity or Polyak-Łojasiewicz (PL) conditions for the lower-level function to establish non-asymptotic convergence to a solution with small hypergradient. However, these assumptions may not hold in practice, and recent work~\citep{chen2024finding} has shown that bilevel optimization is inherently intractable for general convex lower-level functions with the goal of finding small hypergradients. In this paper, we identify a tractable class of bilevel optimization problems that interpolates between lower-level strong convexity and general convexity via \emph{lower-level uniform convexity}. For uniformly convex lower-level functions with exponent $p\geq 2$, we establish a novel implicit differentiation theorem characterizing the hyperobjective's smoothness property. Building on this, we design a new stochastic algorithm, termed UniBiO, with provable convergence guarantees, based on an oracle that provides stochastic gradient and Hessian-vector product information for the bilevel problems. Our algorithm achieves $\widetilde{O}(ε^{-5p+6})$ oracle complexity bound for finding $ε$-stationary points. Notably, our complexity bounds match the optimal rates in terms of the $ε$ dependency for strongly convex lower-level functions ($p=2$), up to logarithmic factors. Our theoretical findings are validated through experiments on synthetic tasks and data hyper-cleaning, demonstrating the effectiveness of our proposed algorithm.

Bilevel Optimization with Lower-Level Uniform Convexity: Theory and Algorithm

TL;DR

A tractable class of bilevel optimization problems that interpolates between lower-level strong convexity and general convexity via lower-level uniform convexity is identified, and a novel implicit differentiation theorem is established characterizing the hyperobjective's smoothness property.

Abstract

Bilevel optimization is a hierarchical framework where an upper-level optimization problem is constrained by a lower-level problem, commonly used in machine learning applications such as hyperparameter optimization. Existing bilevel optimization methods typically assume strong convexity or Polyak-Łojasiewicz (PL) conditions for the lower-level function to establish non-asymptotic convergence to a solution with small hypergradient. However, these assumptions may not hold in practice, and recent work~\citep{chen2024finding} has shown that bilevel optimization is inherently intractable for general convex lower-level functions with the goal of finding small hypergradients. In this paper, we identify a tractable class of bilevel optimization problems that interpolates between lower-level strong convexity and general convexity via \emph{lower-level uniform convexity}. For uniformly convex lower-level functions with exponent , we establish a novel implicit differentiation theorem characterizing the hyperobjective's smoothness property. Building on this, we design a new stochastic algorithm, termed UniBiO, with provable convergence guarantees, based on an oracle that provides stochastic gradient and Hessian-vector product information for the bilevel problems. Our algorithm achieves oracle complexity bound for finding -stationary points. Notably, our complexity bounds match the optimal rates in terms of the dependency for strongly convex lower-level functions (), up to logarithmic factors. Our theoretical findings are validated through experiments on synthetic tasks and data hyper-cleaning, demonstrating the effectiveness of our proposed algorithm.
Paper Structure (41 sections, 25 theorems, 118 equations, 5 figures, 1 table, 2 algorithms)

This paper contains 41 sections, 25 theorems, 118 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Theorem 4.1

Suppose Assumption ass:lowerlevelg and ass:upperlevelf hold. Then $\Phi$ is differentiable in $x$ and can be computed as the following: In addition, the function $\Phi$ satisfies the following properties: where $l_p=\left(\frac{pl_{g,1}}{\mu}\right)^{\frac{1}{p-1}}$, $L_{\phi_1}=l_p(l_{f,1}+ \frac{l_{f,2}l_{g,2}}{\mu}+\frac{l_{g,1}l_{f,1}}{\mu}+\frac{l_{g,1}l_{f,1}l_{g,2}}{\mu^2})$, $L_{\phi_2}=

Figures (5)

  • Figure 1: Convergence results for synthetic experiments on upper-level non-convex, lower-level uniform-convex bilevel optimization with varying uniform-convex parameter $p = [2,4,6,8]$ in the deterministic case and stochastic case with different types of Gaussian noise $\mathcal{N}(0,0.01), \mathcal{N}(0,1.0) , \mathcal{N}(0,10)$ respectively.
  • Figure 2: Results of bilevel optimization on data hyper-cleaning with probability $\tilde{p}=0.1$ and the uniformly convex regularizer $\|w\|_p^p$ with $p=3$. Subfigure (a), (b) show the training and test accuracy with the training epoch. Subfigures (c), (d) show the training and test accuracy with the running time.
  • Figure 3: Results of bilevel optimization on the synthetic example 2 when $p=\{4,12,20\}$. All algorithms are initialized at $(x_0, y_0) = (0.001, 0.001)$, and the upper-level variable is updated for $T = 500$ iterations. The performance of the algorithms was evaluated through the ground-truth hypergradient given by $\nabla \Phi(x) = \sin(x)\cos(\sin(x))$. For all algorithms, learning rates are optimally tuned with a grid search over the range $[0.01, 1]$.
  • Figure 4: Results of bilevel optimization on data hyper-cleaning with noise $\tilde{p}=0.1$ and $p=4$. Subfigure (a), (b) show the training and test accuracy with the training epoch. Subfigure (c), (d) show the training and test accuracy with the running time.
  • Figure 5: Log--log plot of the convergence behavior of the averaged hypergradient norm under different uniform-convexity parameters $p$.

Theorems & Definitions (44)

  • Definition 3.1: Differentiability in Normed Vector Spaces
  • Theorem 4.1: Implicit Differentiation Theorem under LLUC
  • Lemma 4.2: Hölder Continuity of the Lower-Level Optimal Solution Mapping
  • Theorem 5.1
  • Lemma 5.2
  • Corollary 5.3
  • Lemma 5.4
  • Definition A.1
  • Lemma B.1: Restatement of Lemma \ref{['lm:holder']}
  • proof : Proof of \ref{['lm:holder-app']}
  • ...and 34 more