Table of Contents
Fetching ...

Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition

Priya Kasimbeg, Frank Schneider, Runa Eschenhagen, Juhan Bae, Chandramouli Shama Sastry, Mark Saroufim, Boyuan Feng, Less Wright, Edward Z. Yang, Zachary Nado, Sourabh Medapati, Philipp Hennig, Michael Rabbat, George E. Dahl

TL;DR

The paper analyzes the inaugural AlgoPerf: Training Algorithms benchmark, which evaluates speed-ups in neural network training arising purely from algorithmic improvements under fixed hardware and diverse workloads. It introduces two tuning rulesets (external tuning and self-tuning) and uses performance profiles to quantify time-to-target across eight base workloads and held-out variants, reporting substantial gains from non-diagonal preconditioning (Distributed Shampoo) and hyperparameter-free training (Schedule Free AdamW). The results demonstrate meaningful progress (approximately $28\%$ wall-clock speed-ups for Shampoo and $8\%$ for Schedule Free AdamW) but also reveal robustness challenges across workloads and the ongoing importance of fair benchmarking engineering. The study highlights both the practical impact of algorithmic advances and the need for principled benchmarking, including explicit hyperparameter specifications and protocol-sensitive tuning, to guide future training algorithm design.

Abstract

The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition's results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.

Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition

TL;DR

The paper analyzes the inaugural AlgoPerf: Training Algorithms benchmark, which evaluates speed-ups in neural network training arising purely from algorithmic improvements under fixed hardware and diverse workloads. It introduces two tuning rulesets (external tuning and self-tuning) and uses performance profiles to quantify time-to-target across eight base workloads and held-out variants, reporting substantial gains from non-diagonal preconditioning (Distributed Shampoo) and hyperparameter-free training (Schedule Free AdamW). The results demonstrate meaningful progress (approximately wall-clock speed-ups for Shampoo and for Schedule Free AdamW) but also reveal robustness challenges across workloads and the ongoing importance of fair benchmarking engineering. The study highlights both the practical impact of algorithmic advances and the need for principled benchmarking, including explicit hyperparameter specifications and protocol-sensitive tuning, to guide future training algorithm design.

Abstract

The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition's results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.

Paper Structure

This paper contains 19 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: AlgoPerf competition leaderboard & performance profiles for all submissions to the external (top) and self-tuning (bottom) ruleset. The leaderboards (\ref{['tab:leaderboard_et']}, \ref{['tab:leaderboard_st']}) are ranked by the submissions' benchmark scores, rounded to four significant digits. Higher scores indicate faster training. Note, scores are not comparable between rulesets. In the performance profiles (\ref{['fig:pp_external_tuning']}, \ref{['fig:pp_self_tuning']}), each line represents a submission. A step at $\tau$ indicates that, for one workload, this submission reaches the target within $\tau$ times the runtime of the fastest submission for that workload and ruleset.
  • Figure 2: Validation accuracy vs. runtime on the ResNet workload. The Baseline (\ref{['fig:resnet_baseline']}), NadamP (\ref{['fig:resnet_nadamp']}), and PyTorch Distributed Shampoo (\ref{['fig:resnet_shampoo']}) all reach the validation target on the ResNet workload for at least one study but not reliably enough to get a finite score. Shown are the best trials from each of the five studies, where "best" is either the fastest trial to achieve the target performance or the trial whose best performance is closest to the target. Trials that reach the target are marked with a solid line, while studies that do not reach the target are indicated with a dashed line. The gray dashed horizontal and vertical lines indicate the target performance and runtime budget respectively. Additionally, both Amos and Cyclic LR came close but missed the target in all studies.
  • Figure 3: Benchmark score as a function of $\tau_{\text{max}}$. The upper limit of the performance profile and upper integration limit for the benchmark score, $\tau_{\text{max}}$, determines which workload scores are treated as finite and influences the penalty for infinite scores. We observe that rankings remain stable for most submissions across different values of $\tau_{\text{max}}$.
  • Figure 4: Performance profiles of all AlgoPerf submissions when ignoring held-out workloads. Structurally the same as \ref{['fig:perf_profiles']} but here we ignore all benchmark rules involving the held-out workloads.
  • Figure 5: Performance profiles of all AlgoPerf submissions on the qualification workloads. Structurally the same as \ref{['fig:perf_profiles']} and \ref{['fig:perf_profiles_ignore_heldouts']} but only considering the three workloads that are part of the qualification set, i.e. Criteo 1TB, WMT, and OGBG. In the qualification set, no held-out workloads are used.