Stochastic Hyperparameter Optimization through Hypernetworks
Jonathan Lorraine, David Duvenaud
TL;DR
The paper tackles the cost of hyperparameter tuning by replacing nested training loops with a differentiable hypernetwork that maps hyperparameters to near-optimal weights, enabling SGD-based optimization of hyperparameters through the validation loss. It provides both global and local training schemes, with theoretical convergence under mild conditions and practical joint optimization that can handle thousands of hyperparameters. Empirical results show faster convergence and better scalability than unrolled optimization and Gaussian-process-based methods, and demonstrate the approach scales to deeper networks via linear/hybrid hypernetworks. These findings offer a scalable, differentiable alternative to traditional hyperparameter methods and point to avenues for integration with meta-learning and multi-step optimization.
Abstract
Machine learning models are often tuned by nesting optimization of model weights inside the optimization of hyperparameters. We give a method to collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our process trains a neural network to output approximately optimal weights as a function of hyperparameters. We show that our technique converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters.
