Table of Contents
Fetching ...

Convergence Properties of Stochastic Hypergradients

Riccardo Grazzi, Massimiliano Pontil, Saverio Salzo

TL;DR

This work addresses the challenge of efficiently computing hypergradients in bilevel problems where the lower-level constraint is a fixed-point mapping $w(\lambda)=\Phi(w(\lambda),\lambda)$. It introduces Stochastic Implicit Differentiation (SID), a fully stochastic variant of approximate implicit differentiation that uses two stochastic solvers to approximate the inner fixed-point and the associated linear system, and provides a solver-agnostic bound on the mean-square error of the hypergradient estimator. The analysis combines a non-asymptotic bound for stochastic fixed-point iterations with a bias-variance decomposition of the hypergradient estimator, showing convergence to the true gradient as inner accuracies improve. Empirical results on hyperparameter tuning tasks, including regularized logistic regression on MNIST and multinomial tasks on MNIST and Twenty Newsgroups, demonstrate that SID can outperform deterministic AID baselines in large-scale regimes. These findings enable scalable and accurate bilevel optimization for hyperparameter selection and meta-learning in data-rich settings.

Abstract

Bilevel optimization problems are receiving increasing attention in machine learning as they provide a natural framework for hyperparameter optimization and meta-learning. A key step to tackle these problems is the efficient computation of the gradient of the upper-level objective (hypergradient). In this work, we study stochastic approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk minimization on a large dataset. The method that we propose is a stochastic variant of the approximate implicit differentiation approach in (Pedregosa, 2016). We provide bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation. In particular, our main bound is agnostic to the choice of the two stochastic solvers employed by the procedure. We provide numerical experiments to support our theoretical analysis and to show the advantage of using stochastic hypergradients in practice.

Convergence Properties of Stochastic Hypergradients

TL;DR

This work addresses the challenge of efficiently computing hypergradients in bilevel problems where the lower-level constraint is a fixed-point mapping . It introduces Stochastic Implicit Differentiation (SID), a fully stochastic variant of approximate implicit differentiation that uses two stochastic solvers to approximate the inner fixed-point and the associated linear system, and provides a solver-agnostic bound on the mean-square error of the hypergradient estimator. The analysis combines a non-asymptotic bound for stochastic fixed-point iterations with a bias-variance decomposition of the hypergradient estimator, showing convergence to the true gradient as inner accuracies improve. Empirical results on hyperparameter tuning tasks, including regularized logistic regression on MNIST and multinomial tasks on MNIST and Twenty Newsgroups, demonstrate that SID can outperform deterministic AID baselines in large-scale regimes. These findings enable scalable and accurate bilevel optimization for hyperparameter selection and meta-learning in data-rich settings.

Abstract

Bilevel optimization problems are receiving increasing attention in machine learning as they provide a natural framework for hyperparameter optimization and meta-learning. A key step to tackle these problems is the efficient computation of the gradient of the upper-level objective (hypergradient). In this work, we study stochastic approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk minimization on a large dataset. The method that we propose is a stochastic variant of the approximate implicit differentiation approach in (Pedregosa, 2016). We provide bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation. In particular, our main bound is agnostic to the choice of the two stochastic solvers employed by the procedure. We provide numerical experiments to support our theoretical analysis and to show the advantage of using stochastic hypergradients in practice.

Paper Structure

This paper contains 17 sections, 19 theorems, 109 equations, 4 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Suppose that Assumptions ass:aid,ass:phiestimator, and ass:innerbackrates are satisfied. Let $\lambda \in \Lambda$, $t,k \in \mathbb{N}$ and set Then the following hold.

Figures (4)

  • Figure 1: Experiment with a single regularization parameter. Convergence of three variants of SID for 4 choices of the regularization hyperparameter $\lambda \in \mathbb{R}_{++}$. Here, 2 epochs refer, in the Batch version, to one iteration on the lower-level problem plus one iteration on the linear system, whereas, in the Stochastic versions, they refer to $100$ iterations on the lower-level problem plus $100$ iterations on the linear system. The plot shows mean (solid lines) and std (shaded regions) over 5 runs, which vary the train/validation splits and, for the stochastic methods, the order and composition of the minibatches.
  • Figure 2: Experiment with multiple regularization parameters. Convergence of three variants of SID for several choices of the regularization hyperparameter $\lambda \in \mathbb{R}^d_{++}$. The plot shows mean (solid lines) and std (shaded regions) over 10 runs. For each run, $\lambda_i = e^{\epsilon_i}$, where $\epsilon_i \sim \mathcal{U}[-2, 2]$ for every $i \in \{1,\dots, d\}$. Epochs are defined as in \ref{['fig:one']}.
  • Figure 3: Experiments with a single (first 4 images) and multiple (last image) regularization parameters. The plots show mean (solid lines) and std (shaded regions) over 5 (first 4 images) and 10 (last image) runs. Each run varies the train/validation splits and, for the stochastic methods, the order and composition of the minibatches. In addition, for each run in the last image, $\lambda_i = e^{\epsilon_i}$, where $\epsilon_i \sim \mathcal{U}[-2, 2]$ for every $i \in \{1,\dots, d\}$. All methods use the same total computational budget. The first five use the same total number of epochs for solving the lower-level problem and the associated linear system. Whereas the last three methods -- labeled with $75\%/25\%$ -- dedicate $3/4$ of epochs to solve the lower-level problem and only $1/4$ for the linear system.
  • Figure 4: Performance metrics for multinomial logistic regression on twenty newsgroup. All methods compute the hypergradient in $20$ epochs: methods labeled as $75\%/25\%$ compute the lower-level solution in $15$ epochs and the solution for the linear system in $5$, while the others solve both problems in $10$ epochs. The plots show mean (solid line) and max-min (shaded region) over 5 runs varying both the train validation split and the mini-batch sampling of the stochastic algorithms. The starting point is the same for all methods and is set to $\lambda_0= 0$ as in grazzi2020iteration.

Theorems & Definitions (40)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 3.4: MSE bound for SID
  • proof
  • Theorem 4.1: Constant step-size
  • Theorem 4.2: Decreasing step-sizes
  • Theorem 4.3
  • Corollary 4.1
  • Remark 4.1
  • ...and 30 more