A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization

Mathieu Dagréou; Thomas Moreau; Samuel Vaiter; Pierre Ablin

A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization

Mathieu Dagréou, Thomas Moreau, Samuel Vaiter, Pierre Ablin

TL;DR

The paper tackles bilevel empirical risk minimization with finite-sum outer and inner objectives by proposing SRBA, a variance-reduced extension of SARAH that jointly updates the outer, inner, and hypergradient directions. It proves that SRBA achieves $O((n+m)^{1/2}\varepsilon^{-1})$ oracle calls to reach an $\varepsilon$-stationary point, and establishes a matching lower bound of $\Omega(m^{1/2}\varepsilon^{-1})$ in a worst-case construction, confirming near-optimality in the common regime where $n$ and $m$ are balanced. The analysis hinges on a recursive estimation of three directions, a controlled hypergradient approximation $D_x(z,v,x)$, and carefully designed descent lemmas with a Lyapunov function. Empirical results demonstrate SRBA’s fast convergence and strong final performance relative to state-of-the-art bilevel solvers on synthetic and ML-tuned tasks, including hyperparameter selection and datacleaning scenarios.

Abstract

Bilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk minimization problems and therefore have a sum structure. In this context, we propose a bilevel extension of the celebrated SARAH algorithm. We demonstrate that the algorithm requires $\mathcal{O}((n+m)^{\frac12}\varepsilon^{-1})$ oracle calls to achieve $\varepsilon$-stationarity with $n+m$ the total number of samples, which improves over all previous bilevel algorithms. Moreover, we provide a lower bound on the number of oracle calls required to get an approximate stationary point of the objective function of the bilevel problem. This lower bound is attained by our algorithm, making it optimal in terms of sample complexity.

A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization

TL;DR

oracle calls to reach an

-stationary point, and establishes a matching lower bound of

in a worst-case construction, confirming near-optimality in the common regime where

and

are balanced. The analysis hinges on a recursive estimation of three directions, a controlled hypergradient approximation

, and carefully designed descent lemmas with a Lyapunov function. Empirical results demonstrate SRBA’s fast convergence and strong final performance relative to state-of-the-art bilevel solvers on synthetic and ML-tuned tasks, including hyperparameter selection and datacleaning scenarios.

Abstract

oracle calls to achieve

-stationarity with

the total number of samples, which improves over all previous bilevel algorithms. Moreover, we provide a lower bound on the number of oracle calls required to get an approximate stationary point of the objective function of the bilevel problem. This lower bound is attained by our algorithm, making it optimal in terms of sample complexity.

Paper Structure (28 sections, 20 theorems, 167 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 28 sections, 20 theorems, 167 equations, 2 figures, 1 table, 1 algorithm.

Introduction
SRBA: a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization
Assumptions
Hypergradient Approximation
SRBA: Stochastic Recursive Bilevel Algorithm
Theoretical Analysis of SRBA
Mean Squared Error of the Estimated Directions
Fundamental Lemmas
Complexity Analysis of SRBA
Lower Bound for Bilevel ERM
Numerical Experiments
Conclusion
Convergence analysis of SRBA
Proof of \ref{['prop:directions_cancels']}
Smoothness constant of $h$
...and 13 more sections

Key Result

Proposition 2.3

Under Assumptions ass:regul_F and ass:regul_G, the function $h$ is $L^h$ smooth for some $L^h>0$ which is precised in app:sec:smoothness_h.

Figures (2)

Figure 1: Comparison of the behavior of SRBA with other stochastic bilevel solvers. For each experiment, the solvers are run with 10 different seeds and the median performance over these seeds is reported. The shaded area corresponds to the performances between the 20% and the 80% percentiles. The performances are reported with respect to wall-clock time. Top: Experiments on quadratic functions. We report the gradient norm of the value function. Bottom: Hyperparameter selection with the IJCNN1 dataset.
Figure D.1: Comparison of stochastic bilevel solvers. Each solver is run on 10 random seeds and the lines show the median performances. The shaded area corresponds to the performances between the 20% and the 80% percentiles. Test error on the datacleaning task with the MNIST dataset with a corruption rate $0.5$.

Theorems & Definitions (41)

Proposition 2.3
Proposition 2.4
Proposition 2.5
Proposition 2.6
Definition 3.1
Proposition 3.2: MSE of the estimate directions
Lemma 3.3: Descent on the inner level
Lemma 3.4
Lemma 3.5
Theorem 1
...and 31 more

A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization

TL;DR

Abstract

A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (41)