Scalable Nested Optimization for Deep Learning

Jonathan Lorraine

Scalable Nested Optimization for Deep Learning

Jonathan Lorraine

TL;DR

This work addresses scalable nested optimization in deep learning by reframing bilevel objectives into tractable, high-dimensional strategies. It introduces four complementary threads: hypernetworks for amortized hyperparameter optimization, implicit-function-theorem-based hypergradients for millions of hyperparameters, complex momentum to stabilize optimization in differentiable games, and Generalized Ridge Rider to find diverse equilibria via Lyapunov exponents. Collectively, these methods enable tuning millions of parameters and hyperparameters, accelerating hyperparameter search, stabilizing adversarial training, and discovering multiple robust solutions in multi-agent settings. The results demonstrate practical gains in hyperparameter tuning, GAN training, and diversification of equilibria, with broad implications for scalable nested optimization in modern deep learning. The work provides a coherent framework and actionable algorithms for scaling nested optimization to the scale of contemporary networks, with concrete evidence across datasets and architectures. $\mathcal{L}$-based formulations, Jacobians, and inverse-Hessian approximations underpin the theoretical guarantees and practical efficiency of the proposed methods.

Abstract

Gradient-based optimization has been critical to the success of machine learning, updating a single set of parameters to minimize a single loss. A growing number of applications rely on a generalization of this, where we have a bilevel or nested optimization of which subsets of parameters update on different objectives nested inside each other. We focus on motivating examples of hyperparameter optimization and generative adversarial networks. However, naively applying classical methods often fails when we look at solving these nested problems on a large scale. In this thesis, we build tools for nested optimization that scale to deep learning setups.

Scalable Nested Optimization for Deep Learning

TL;DR

-based formulations, Jacobians, and inverse-Hessian approximations underpin the theoretical guarantees and practical efficiency of the proposed methods.

Abstract

Paper Structure (126 sections, 13 theorems, 97 equations, 60 figures, 11 tables, 13 algorithms)

This paper contains 126 sections, 13 theorems, 97 equations, 60 figures, 11 tables, 13 algorithms.

Introduction
Thesis Outline
Summary of Publications
Research Used in Thesis
Non-thesis Research
Hyperparameter Optimization Through Hypernetworks
Introduction
Training a network to output optimal weights
Advantages of hypernetwork-based optimization
Limitations of hypernetwork-based optimization
Jointly training parameters and hyperparameters
Related Work
Experiments
Learning a global best-response
Learning a local best-response
...and 111 more sections

Key Result

Theorem 1

If for some $\left(\boldsymbol{\lambda}', \textbf{w}'\right), \left. \pd{\mathcal{L}_{\text{T}}}{\textbf{w}} \right|_{\boldsymbol{\lambda}', \textbf{w}'} = 0$ and regularity conditions are satisfied, then surrounding $\left(\boldsymbol{\lambda}', \textbf{w}'\right)$ there is a function $\textbf{w}^{

Figures (60)

Figure 1: Left: A typical computational graph for cross-validation, where $\alpha$ are the optimizer parameters, and $\boldsymbol{\lambda}$ are training loss hyperparameters. It is expensive to differentiate throughout the training procedure. Right: The proposed computational graph with our changes in red, where ${\boldsymbol{\phi}}$ are the hypernetwork parameters. We can differentiate cheaply through the hypernetwork to optimize the validation loss $\mathcal{L}_{\text{V}}$ with respect to hyperparameters $\boldsymbol{\lambda}$. We use $\textbf{x}$, $\textbf{t}$, and $\textbf{y}$ to refer to a data point, a label, and a prediction.
Figure 2: The validation loss of a neural net is estimated by cross-validation (crosses) or a hypernetwork (line), which outputs 7850.0-dimensional network weights. Cross-validation requires optimizing from scratch each time. The hypernetwork can be used as a proxy to cheaply evaluate the best-responding validation loss $\mathcal{L}_{\text{V}}^{*}$.
Figure 3: A visualization of exact (blue) and approximate (red) optimal weights as a function of hyperparameters. The approximately optimal weights $\textbf{w}_{\phi^{*}}$ are produced by a linear model fit at $\hat{\lambda}$. The true optimal hyperparameter is $\lambda^{*}$, while the hyperparameter minimizing the hypernetwork-approximated validation loss is $\lambda_{\phi^{*}}$.
Figure 4: Training and validation losses of a neural network are estimated by cross-validation (crosses) or a linear hypernetwork (lines). The hypernetwork's limited capacity makes it only accurate where the hyperparameter distribution puts mass.
Figure 5: Validation and test losses during hyperparameter optimization with a separate $\ell_{2}$ weight decay applied to each weight in the model. Thus, models with more parameters have more hyperparameters. Left: We solve the $7850.0$-dimensional hyperparameter optimization problem with a linear network and multiple algorithms. Hypernetwork-based optimization converges to a suboptimal solution faster than unrolled optimization from maclaurin2015gradient. Right: Hyper-training is applied to different layer configurations in the model.
...and 55 more figures

Theorems & Definitions (20)

Theorem 1: Cauchy, Implicit Function Theorem
Lemma
Theorem 2: Neumann-SGD
Theorem 3: Consequence of Prop. 4.4.1 bertsekas2008nonlinear
Corollary 1: Convergence of Complex Momentum
Theorem : Augustin-Louis Cauchy, Implicit Function Theorem
Lemma : 1
proof
Lemma : 2
proof
...and 10 more

Scalable Nested Optimization for Deep Learning

TL;DR

Abstract

Scalable Nested Optimization for Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (60)

Theorems & Definitions (20)