Table of Contents
Fetching ...

MLPerf Training Benchmark

Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Tsuguchika Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, Matei Zaharia

TL;DR

MLPerf Training Benchmark defines an end-to-end, multi-workload benchmark to fairly evaluate DL training across diverse hardware and software. It combines seven representative workloads with carefully chosen time-to-train metrics, quality thresholds, and reference implementations to balance performance with learning outcomes. The framework introduces open/closed divisions, three system categories, and reproducible submission processes, yielding actionable insights while encouraging ongoing community updates. Across two rounds (v0.5 and v0.6), MLPerf demonstrates tangible improvements in performance and scalability, validating its role in driving real-world ML training optimization.

Abstract

Machine learning (ML) needs industry-standard performance benchmarks to support design and competitive evaluation of the many emerging software and hardware solutions for ML. But ML training presents three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time to solution exhibits high variance, and software and hardware systems are so diverse that fair benchmarking with the same binary, code, and even hyperparameters is difficult. We therefore present MLPerf, an ML benchmark that overcomes these challenges. Our analysis quantitatively evaluates MLPerf's efficacy at driving performance and scalability improvements across two rounds of results from multiple vendors.

MLPerf Training Benchmark

TL;DR

MLPerf Training Benchmark defines an end-to-end, multi-workload benchmark to fairly evaluate DL training across diverse hardware and software. It combines seven representative workloads with carefully chosen time-to-train metrics, quality thresholds, and reference implementations to balance performance with learning outcomes. The framework introduces open/closed divisions, three system categories, and reproducible submission processes, yielding actionable insights while encouraging ongoing community updates. Across two rounds (v0.5 and v0.6), MLPerf demonstrates tangible improvements in performance and scalability, validating its role in driving real-world ML training optimization.

Abstract

Machine learning (ML) needs industry-standard performance benchmarks to support design and competitive evaluation of the many emerging software and hardware solutions for ML. But ML training presents three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time to solution exhibits high variance, and software and hardware systems are so diverse that fair benchmarking with the same binary, code, and even hyperparameters is difficult. We therefore present MLPerf, an ML benchmark that overcomes these challenges. Our analysis quantitatively evaluates MLPerf's efficacy at driving performance and scalability improvements across two rounds of results from multiple vendors.

Paper Structure

This paper contains 36 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Training epochs to reach the target quality for the MLPerf v0.5 NCF (a) and MiniGo (b) benchmarks. Each experiment uses identical hyperparameters except for the random seed. For MiniGo, we observed considerable variability across runs even when fixing the random seed (same color).
  • Figure 2: Top-1 accuracy of MLPerf v0.5 ResNet-50 benchmark over 100 epochs for five runs (denoted by color) with identical hyperparameters but different random seeds. The dashed line indicates the quality target of 74.9% Top-1 accuracy. The early training phase exhibits much more variability than later phases.
  • Figure 3: Speedup in the fastest 16-chip entry from MLPerf version v0.5 to v0.6 for various benchmarks common to both (Figure 3a), along with quality-target increases (Figure 3b).
  • Figure 4: Number of chips necessary to produce the fastest time to solution for MLPerf versions v0.5 to v0.6. This number increased by as much as $5.5\times$.