Table of Contents
Fetching ...

A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning

Minyoung Kim, Timothy M. Hospedales

TL;DR

The paper reframes differentiable BLO in meta-learning as stochastic optimization by turning the inner objective into a posterior over inner parameters and the outer objective into an expectation under that posterior. It introduces HPO-SGLD, a practical SGLD-based hypergradient estimator that uses a forward-recursive scheme to avoid Hessian storage, achieving linear convergence with shared memory $O(\dim(\theta)+\dim(\lambda))$. The method robustly handles inner-optimization uncertainty, minibatch noise, and multiple inner minima, demonstrating strong performance across HPO, loss-function learning, few-shot learning, INR meta-learning, and invariance learning, and scaling to large models (e.g., Vision Transformers with tens of millions of parameters). The approach offers favorable stability and scalability compared to traditional BLO methods (IFT, FMD, RMD, FMD), with empirical results supporting its effectiveness and broad applicability. Overall, the work provides a unified, uncertainty-aware BLO framework that is both theoretically grounded and practically scalable for diverse meta-learning tasks.

Abstract

We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning, including hyperparameter optimization, loss function learning, few-shot learning, invariance learning and more. These problems are often formalized as Bi-Level optimizations (BLO). We introduce a novel perspective by turning a given BLO problem into a stochastic optimization, where the inner loss function becomes a smooth probability distribution, and the outer loss becomes an expected loss over the inner distribution. To solve this stochastic optimization, we adopt Stochastic Gradient Langevin Dynamics (SGLD) MCMC to sample inner distribution, and propose a recurrent algorithm to compute the MC-estimated hypergradient. Our derivation is similar to forward-mode differentiation, but we introduce a new first-order approximation that makes it feasible for large models without needing to store huge Jacobian matrices. The main benefits are two-fold: i) Our stochastic formulation takes into account uncertainty, which makes the method robust to suboptimal inner optimization or non-unique multiple inner minima due to overparametrization; ii) Compared to existing methods that often exhibit unstable behavior and hyperparameter sensitivity in practice, our method leads to considerably more reliable solutions. We demonstrate that the new approach achieves promising results on diverse meta learning problems and easily scales to learning 87M hyperparameters in the case of Vision Transformers.

A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning

TL;DR

The paper reframes differentiable BLO in meta-learning as stochastic optimization by turning the inner objective into a posterior over inner parameters and the outer objective into an expectation under that posterior. It introduces HPO-SGLD, a practical SGLD-based hypergradient estimator that uses a forward-recursive scheme to avoid Hessian storage, achieving linear convergence with shared memory . The method robustly handles inner-optimization uncertainty, minibatch noise, and multiple inner minima, demonstrating strong performance across HPO, loss-function learning, few-shot learning, INR meta-learning, and invariance learning, and scaling to large models (e.g., Vision Transformers with tens of millions of parameters). The approach offers favorable stability and scalability compared to traditional BLO methods (IFT, FMD, RMD, FMD), with empirical results supporting its effectiveness and broad applicability. Overall, the work provides a unified, uncertainty-aware BLO framework that is both theoretically grounded and practically scalable for diverse meta-learning tasks.

Abstract

We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning, including hyperparameter optimization, loss function learning, few-shot learning, invariance learning and more. These problems are often formalized as Bi-Level optimizations (BLO). We introduce a novel perspective by turning a given BLO problem into a stochastic optimization, where the inner loss function becomes a smooth probability distribution, and the outer loss becomes an expected loss over the inner distribution. To solve this stochastic optimization, we adopt Stochastic Gradient Langevin Dynamics (SGLD) MCMC to sample inner distribution, and propose a recurrent algorithm to compute the MC-estimated hypergradient. Our derivation is similar to forward-mode differentiation, but we introduce a new first-order approximation that makes it feasible for large models without needing to store huge Jacobian matrices. The main benefits are two-fold: i) Our stochastic formulation takes into account uncertainty, which makes the method robust to suboptimal inner optimization or non-unique multiple inner minima due to overparametrization; ii) Compared to existing methods that often exhibit unstable behavior and hyperparameter sensitivity in practice, our method leads to considerably more reliable solutions. We demonstrate that the new approach achieves promising results on diverse meta learning problems and easily scales to learning 87M hyperparameters in the case of Vision Transformers.

Paper Structure

This paper contains 40 sections, 5 theorems, 47 equations, 3 figures, 16 tables, 1 algorithm.

Key Result

Theorem 1

If (A1) and (A2) in Supplement appsec:convergence hold, then our first-order approximation used for $g_m(\lambda)$ recursion becomes exact (i.e., approximation error $\simeq 0$).

Figures (3)

  • Figure 1: Illustrative toy problem. (a) Training and validation data. (b) Two $\theta^*(\lambda)$ solutions at $\lambda\!=\!0.1$; (Left) Good $\theta^*(\lambda)$ (val loss 0.0083), (Right) Poor $\theta^*(\lambda)$ (val loss 0.3833). (c) $\mathcal{L}_V(\lambda,\theta^*(\lambda))$. Each row $\!=\!\lambda$, each column $\!=\!$ one of $\theta^*(\lambda)$. Blue (red) indicates low (high) loss value. (d) Row-wise average validation losses $\mathbb{E}_{p(\theta|\lambda)}[\mathcal{L}_V(\lambda,\theta)]$ employed in our proposed stochastic optimization (SO).
  • Figure 2: (Synthetic 1D) The number of inner iterations vs. the errors of the learned solutions, (Left) $|\lambda-\lambda^*|$ and (Right) $|\theta-\theta^*|$.
  • Figure 3: (Left) Relative errors between the true products of gradients vs. our first-order approximates. (Right) Cumulative errors between true hypergradients and our estimates. Red/dotted lines indicate the end of the burn-in period.

Theorems & Definitions (7)

  • Theorem 1: Exactness of First-order Approximation
  • Theorem 2: Convergence of HPO-SGLD
  • Theorem 3: Exactness of First-order Approximation
  • proof
  • Theorem 4: Convergence of HPO-SGLD
  • proof
  • Theorem 5: alg_diff1