Scalable Nested Optimization for Deep Learning
Jonathan Lorraine
TL;DR
This work addresses scalable nested optimization in deep learning by reframing bilevel objectives into tractable, high-dimensional strategies. It introduces four complementary threads: hypernetworks for amortized hyperparameter optimization, implicit-function-theorem-based hypergradients for millions of hyperparameters, complex momentum to stabilize optimization in differentiable games, and Generalized Ridge Rider to find diverse equilibria via Lyapunov exponents. Collectively, these methods enable tuning millions of parameters and hyperparameters, accelerating hyperparameter search, stabilizing adversarial training, and discovering multiple robust solutions in multi-agent settings. The results demonstrate practical gains in hyperparameter tuning, GAN training, and diversification of equilibria, with broad implications for scalable nested optimization in modern deep learning. The work provides a coherent framework and actionable algorithms for scaling nested optimization to the scale of contemporary networks, with concrete evidence across datasets and architectures. $\mathcal{L}$-based formulations, Jacobians, and inverse-Hessian approximations underpin the theoretical guarantees and practical efficiency of the proposed methods.
Abstract
Gradient-based optimization has been critical to the success of machine learning, updating a single set of parameters to minimize a single loss. A growing number of applications rely on a generalization of this, where we have a bilevel or nested optimization of which subsets of parameters update on different objectives nested inside each other. We focus on motivating examples of hyperparameter optimization and generative adversarial networks. However, naively applying classical methods often fails when we look at solving these nested problems on a large scale. In this thesis, we build tools for nested optimization that scale to deep learning setups.
