Table of Contents
Fetching ...

Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

Robin M. Schmidt, Frank Schneider, Philipp Hennig

TL;DR

This work addresses the challenge of selecting and tuning optimizers for deep learning by conducting a large-scale benchmark of 15 optimizers across 8 DeepOBS problems under 4 tuning budgets and 4 learning-rate schedules, totaling over $53\,760$ training runs. It reveals that optimizer performance is highly task-dependent and that evaluating multiple optimizers with default settings can be as effective as tuning a single one, with tuning and schedules offering modest, yet variable, gains. No single method dominates across all tasks, though Adam and its variants frequently perform well; some problems favor alternatives like NAG or RMSProp, highlighting task-specific dynamics. By releasing an open, extensible dataset, the authors provide a practical baselines resource for future optimizer development and meta-learning, encouraging research toward robust inner-loop tuning and problem-aware optimization strategies.

Abstract

Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than $50,000$ individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: Adam remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.

Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

TL;DR

This work addresses the challenge of selecting and tuning optimizers for deep learning by conducting a large-scale benchmark of 15 optimizers across 8 DeepOBS problems under 4 tuning budgets and 4 learning-rate schedules, totaling over training runs. It reveals that optimizer performance is highly task-dependent and that evaluating multiple optimizers with default settings can be as effective as tuning a single one, with tuning and schedules offering modest, yet variable, gains. No single method dominates across all tasks, though Adam and its variants frequently perform well; some problems favor alternatives like NAG or RMSProp, highlighting task-specific dynamics. By releasing an open, extensible dataset, the authors provide a practical baselines resource for future optimizer development and meta-learning, encouraging research toward robust inner-loop tuning and problem-aware optimization strategies.

Abstract

Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: Adam remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.

Paper Structure

This paper contains 30 sections, 1 equation, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Number of times ArXiv titles and abstracts mention specific optimizer per year. All non-selected optimizers from \ref{['tab:Optimizers']} in the appendix are grouped into Other. This figure illustrates not only the expected increase in both methods and mentions, but also that our selection covers the most popular methods. In $2020$, the excluded methods accounted for $<4\,\%$ of the mentions (see \ref{['fig:arxiv_normalized']}).
  • Figure 2: The test set performance improvement after switching from any untuned optimizer ($y$-axis, one-shot) to any tuned optimizer ($x$-axis, small budget) as an average over $10$ random seeds for the constant schedule. For example, the bottom left cell of the largest matrix indicates that the tuned version of AMSBound (1) reaches a $2.4\,\%$ higher test accuracy than untuned SGD (15). We discuss the unintuitive occurrence of negative diagonal entries in \ref{['sec:heatmaps_appendix']}. The colormap is capped at $\pm 3$ to improve presentation, although larger values occur.
  • Figure 3: Lines in gray (---, smoothed by cubic splines for visual guidance only) show the relative improvement for a certain tuning budget and schedule (compared to the one-shot tuning without schedule) for all fifteen optimizers on all eight problems. The median over all lines is plotted in orange (---) with the shaded area (❚) indicating the area between the 25th and 75th percentile. With an increased budget and a schedule, one can expect a performance increase on average (orange lines), but not automatically for individual settings (i.e. gray lines can be unaffected or even decrease).
  • Figure 4: Mean test set performance over $10$ random seeds of all tested optimizers on all eight optimization problems using the large budget for tuning and no learning rate schedule. One standard deviation for the tunedAdam optimizer is shown with a red error bar (I; error bars for other methods omitted for legibility). The performance of untunedAdam (▼) and AdaBound (▲) are marked for reference. The upper bound of each axis represents the best performance achieved in the benchmark, while the lower bound is chosen in relation to the performance of Adam with default parameters. Tabular version available in the Appendix as \ref{['tab:app_tabular_large_none']}.
  • Figure 5: Performance of SGD on a simple multilayer perceptron. For each learning rate, markers in orange (✖) show the initial seed which would be used for tuning, blue markers (✖) illustrate nine additional seeds with otherwise unchanged settings. The mean over all seeds is plotted as a blue line (---), showing one standard deviation as a shaded area (❚).
  • ...and 12 more figures