Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers
Robin M. Schmidt, Frank Schneider, Philipp Hennig
TL;DR
This work addresses the challenge of selecting and tuning optimizers for deep learning by conducting a large-scale benchmark of 15 optimizers across 8 DeepOBS problems under 4 tuning budgets and 4 learning-rate schedules, totaling over $53\,760$ training runs. It reveals that optimizer performance is highly task-dependent and that evaluating multiple optimizers with default settings can be as effective as tuning a single one, with tuning and schedules offering modest, yet variable, gains. No single method dominates across all tasks, though Adam and its variants frequently perform well; some problems favor alternatives like NAG or RMSProp, highlighting task-specific dynamics. By releasing an open, extensible dataset, the authors provide a practical baselines resource for future optimizer development and meta-learning, encouraging research toward robust inner-loop tuning and problem-aware optimization strategies.
Abstract
Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than $50,000$ individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: Adam remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.
