Table of Contents
Fetching ...

Hyperparameter Loss Surfaces Are Simple Near their Optima

Nicholas Lourie, He He, Kyunghyun Cho

TL;DR

A new asymptotic law for random search is derived that can explain and extrapolate its convergence and enable new analyses, such as confidence intervals for the best possible performance or determining the effective number of hyperparameters.

Abstract

Hyperparameters greatly impact models' capabilities; however, modern models are too large for extensive search. Instead, researchers design recipes that train well across scales based on their understanding of the hyperparameters. Despite this importance, few tools exist for understanding the hyperparameter loss surface. We discover novel structure in it and propose a new theory yielding such tools. The loss surface is complex, but as you approach the optimum simple structure emerges. It becomes characterized by a few basic features, like its effective dimension and the best possible loss. To uncover this asymptotic regime, we develop a novel technique based on random search. Within this regime, the best scores from random search take on a new distribution we discover. Its parameters are exactly the features defining the loss surface in the asymptotic regime. From these features, we derive a new asymptotic law for random search that can explain and extrapolate its convergence. These new tools enable new analyses, such as confidence intervals for the best possible performance or determining the effective number of hyperparameters. We make these tools available at https://github.com/nicholaslourie/opda .

Hyperparameter Loss Surfaces Are Simple Near their Optima

TL;DR

A new asymptotic law for random search is derived that can explain and extrapolate its convergence and enable new analyses, such as confidence intervals for the best possible performance or determining the effective number of hyperparameters.

Abstract

Hyperparameters greatly impact models' capabilities; however, modern models are too large for extensive search. Instead, researchers design recipes that train well across scales based on their understanding of the hyperparameters. Despite this importance, few tools exist for understanding the hyperparameter loss surface. We discover novel structure in it and propose a new theory yielding such tools. The loss surface is complex, but as you approach the optimum simple structure emerges. It becomes characterized by a few basic features, like its effective dimension and the best possible loss. To uncover this asymptotic regime, we develop a novel technique based on random search. Within this regime, the best scores from random search take on a new distribution we discover. Its parameters are exactly the features defining the loss surface in the asymptotic regime. From these features, we derive a new asymptotic law for random search that can explain and extrapolate its convergence. These new tools enable new analyses, such as confidence intervals for the best possible performance or determining the effective number of hyperparameters. We make these tools available at https://github.com/nicholaslourie/opda .

Paper Structure

This paper contains 24 sections, 4 theorems, 46 equations, 8 figures.

Key Result

Proposition F.1

Let $\mathbb{X} \subset \mathbb{R}^d$ be compact, $\mathbb{Y} \subset \mathbb{R}$, $g: \mathbb{X} \to \mathbb{Y}$ continuous, and $y_* = g(\pmb{x}_*)$ its unique minimum. Then $\forall \delta > 0$, $\exists \epsilon$ such that $|g(\pmb{x}) - y_*| < \epsilon$ implies $\|\pmb{x} - \pmb{x}_*\| < \delta

Figures (8)

  • Figure 1: The hyperparameter loss surface has simple structure near the optimum. Using this structure, we can reason about how the validation score will improve as we run an algorithm like random search. The plots compare the theoretical functional form against the empirical rate of progress using 1,024 training runs in each. The ground truth (dashed blue) closely adheres to the theoretical form (solid yellow), with that form remaining fully inside its 95% confidence bands. Across all three scenarios---language model pretraining (log loss), supervised finetuning (error rate), and image classification (error rate)---the simple structure near the optimum drives the practical outcomes of hyperparameter search after just 1 or 2 iterations.
  • Figure 2: With more iterations, random search finds better hyperparameters. As the best score improves, the region of better hyperparameters shrinks around the optimum; thereby, the Taylor polynomial gives better approximations at the hyperparameters that improve the score.
  • Figure 3: A comparison of the score distribution (empirical) and noisy quadratic (theoretical). The top row depicts CDFs, the bottom row depicts tuning curves. Each column corresponds to a different scenario: pretraining Llama 33M on SlimPajama-6B (cross-entropy), finetuning DeBERTaV3 on MultiNLI (error rate), and training ResNet18 on ImageNet (error rate). The asymptotic regime is the performance threshold beyond which the theoretical approximations apply. All estimates use the scenarios' full 1,024 iterations of random search. In the asymptotic regime, the noisy quadratic distribution matches the score distribution from random search.
  • Figure 4: Diagnostics for the score distribution's normality given fixed hyperparameters. The top row shows histograms with kernel density estimates; the bottom shows Q-Q plots. Columns represent configurations across error rate percentiles for ResNet18 on ImageNet. All except the worst performing hyperparameters demonstrate a very high degree of normality.
  • Figure 5: A comparison of standard deviations for the score given fixed hyperparameters. The $x$-axis gives configurations at different error rate percentiles for ResNet18 on ImageNet. Points are standard deviations at those percentiles. Confidence intervals are simultaneous. The standard deviation quickly converges to a constant long before the asymptotic regime.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Proposition F.1
  • proof
  • Theorem F.2
  • proof
  • Proposition F.3
  • proof
  • Proposition F.4
  • proof