Table of Contents
Fetching ...

Stochastic Hyperparameter Optimization through Hypernetworks

Jonathan Lorraine, David Duvenaud

TL;DR

The paper tackles the cost of hyperparameter tuning by replacing nested training loops with a differentiable hypernetwork that maps hyperparameters to near-optimal weights, enabling SGD-based optimization of hyperparameters through the validation loss. It provides both global and local training schemes, with theoretical convergence under mild conditions and practical joint optimization that can handle thousands of hyperparameters. Empirical results show faster convergence and better scalability than unrolled optimization and Gaussian-process-based methods, and demonstrate the approach scales to deeper networks via linear/hybrid hypernetworks. These findings offer a scalable, differentiable alternative to traditional hyperparameter methods and point to avenues for integration with meta-learning and multi-step optimization.

Abstract

Machine learning models are often tuned by nesting optimization of model weights inside the optimization of hyperparameters. We give a method to collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our process trains a neural network to output approximately optimal weights as a function of hyperparameters. We show that our technique converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters.

Stochastic Hyperparameter Optimization through Hypernetworks

TL;DR

The paper tackles the cost of hyperparameter tuning by replacing nested training loops with a differentiable hypernetwork that maps hyperparameters to near-optimal weights, enabling SGD-based optimization of hyperparameters through the validation loss. It provides both global and local training schemes, with theoretical convergence under mild conditions and practical joint optimization that can handle thousands of hyperparameters. Empirical results show faster convergence and better scalability than unrolled optimization and Gaussian-process-based methods, and demonstrate the approach scales to deeper networks via linear/hybrid hypernetworks. These findings offer a scalable, differentiable alternative to traditional hyperparameter methods and point to avenues for integration with meta-learning and multi-step optimization.

Abstract

Machine learning models are often tuned by nesting optimization of model weights inside the optimization of hyperparameters. We give a method to collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our process trains a neural network to output approximately optimal weights as a function of hyperparameters. We show that our technique converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters.

Paper Structure

This paper contains 18 sections, 1 theorem, 4 equations, 8 figures, 4 algorithms.

Key Result

Theorem 2.1

Sufficiently powerful hypernetworks can learn continuous best-response functions, which minimizes the expected loss for all hyperparameter distributions with convex support.

Figures (8)

  • Figure 1: Left: A typical computational graph for cross-validation, where $\alpha$ are the optimizer parameters, and $\lambda$ are training loss hyperparameters. It is expensive to differentiate through the entire training procedure. Right: The proposed computational graph with our changes in red, where $\phi$ are the hypernetwork parameters. We can cheaply differentiate through the hypernetwork to optimize the validation loss $\mathop{\mathcal{L}}_{\mathrm{Valid.}}$ with respect to hyperparameters $\lambda$. We use $x$, $t$, and $y$ to refer to a data point, its label, and a prediction respectively.
  • Figure 2: The validation loss of a neural net, estimated by cross-validation (crosses) or by a hypernetwork (line), which outputs $7,850$-dimensional network weights. Cross-validation requires optimizing from scratch each time. The hypernetwork can be used to evaluate the validation loss cheaply.
  • Figure 3: A visualization of exact (blue) and approximate (red) optimal weights as a function of hyperparameters. The approximately optimal weights $\mathrm{w}_{\phi^{*}}$ are output by a linear model fit at ${\hat{\lambda}}$. The true optimal hyperparameter is $\lambda^{*}$, while the hyperparameter estimated using approximately optimal weights is nearby at $\lambda_{\phi^{*}}$.
  • Figure 4: Training and validation losses of a neural network, estimated by cross-validation (crosses) or a linear hypernetwork (lines). The hypernetwork's limited capacity makes it only accurate where the hyperparameter distribution puts mass.
  • Figure 5: Validation and test losses during hyperparameter optimization with a separate $L_{2}$ weight decay applied to each weight in the model. Thus, models with more parameters have more hyperparameters. Top: We solve the $7,850$-dimensional hyperparameter optimization problem with a linear model and multiple algorithms. Hypernetwork-based optimization converges to a sub-optimal solution faster than unrolled optimization from maclaurin2015gradient. Bottom: Hyper-training is applied different layer configurations in the model.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 2.1
  • proof