A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning
Minyoung Kim, Timothy M. Hospedales
TL;DR
The paper reframes differentiable BLO in meta-learning as stochastic optimization by turning the inner objective into a posterior over inner parameters and the outer objective into an expectation under that posterior. It introduces HPO-SGLD, a practical SGLD-based hypergradient estimator that uses a forward-recursive scheme to avoid Hessian storage, achieving linear convergence with shared memory $O(\dim(\theta)+\dim(\lambda))$. The method robustly handles inner-optimization uncertainty, minibatch noise, and multiple inner minima, demonstrating strong performance across HPO, loss-function learning, few-shot learning, INR meta-learning, and invariance learning, and scaling to large models (e.g., Vision Transformers with tens of millions of parameters). The approach offers favorable stability and scalability compared to traditional BLO methods (IFT, FMD, RMD, FMD), with empirical results supporting its effectiveness and broad applicability. Overall, the work provides a unified, uncertainty-aware BLO framework that is both theoretically grounded and practically scalable for diverse meta-learning tasks.
Abstract
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning, including hyperparameter optimization, loss function learning, few-shot learning, invariance learning and more. These problems are often formalized as Bi-Level optimizations (BLO). We introduce a novel perspective by turning a given BLO problem into a stochastic optimization, where the inner loss function becomes a smooth probability distribution, and the outer loss becomes an expected loss over the inner distribution. To solve this stochastic optimization, we adopt Stochastic Gradient Langevin Dynamics (SGLD) MCMC to sample inner distribution, and propose a recurrent algorithm to compute the MC-estimated hypergradient. Our derivation is similar to forward-mode differentiation, but we introduce a new first-order approximation that makes it feasible for large models without needing to store huge Jacobian matrices. The main benefits are two-fold: i) Our stochastic formulation takes into account uncertainty, which makes the method robust to suboptimal inner optimization or non-unique multiple inner minima due to overparametrization; ii) Compared to existing methods that often exhibit unstable behavior and hyperparameter sensitivity in practice, our method leads to considerably more reliable solutions. We demonstrate that the new approach achieves promising results on diverse meta learning problems and easily scales to learning 87M hyperparameters in the case of Vision Transformers.
