Breaking MLPerf Training: A Case Study on Optimizing BERT

Yongdeok Kim; Jaehyung Ahn; Myeongwoo Kim; Changin Choi; Heejae Kim; Narankhuu Tuvshinjargal; Seungwon Lee; Yanzi Zhang; Yuan Pei; Xiongzhan Linghu; Jingkun Ma; Lin Chen; Yuehua Dai; Sungjoo Yoo

Breaking MLPerf Training: A Case Study on Optimizing BERT

Yongdeok Kim, Jaehyung Ahn, Myeongwoo Kim, Changin Choi, Heejae Kim, Narankhuu Tuvshinjargal, Seungwon Lee, Yanzi Zhang, Yuan Pei, Xiongzhan Linghu, Jingkun Ma, Lin Chen, Yuehua Dai, Sungjoo Yoo

TL;DR

This work tackles the efficiency bottlenecks of large-scale distributed BERT training by addressing load balancing and computation/communication overlap. It introduces local presorting via dataset stratification to balance work with minimal inter-node communication and bucket-wise gradient clipping before allreduce to enable bucket-level overlap during backpropagation. Through hyperparameter optimization of ADAM and a 1,024-GPU deployment, the approach achieves state-of-the-art MLPerf BERT times, outperforming prior submissions. The methods are practical for large-scale NLP pretraining and generalize to other transformer-based training pipelines with similar data distributions.

Abstract

Speeding up the large-scale distributed training is challenging in that it requires improving various components of training including load balancing, communication, optimizers, etc. We present novel approaches for fast large-scale training of BERT model which individually ameliorates each component thereby leading to a new level of BERT training performance. Load balancing is imperative in distributed BERT training since its training datasets are characterized by samples with various lengths. Communication cost, which is proportional to the scale of distributed training, needs to be hidden by useful computation. In addition, the optimizers, e.g., ADAM, LAMB, etc., need to be carefully re-evaluated in the context of large-scale distributed training. We propose two new ideas, (1) local presorting based on dataset stratification for load balancing and (2) bucket-wise gradient clipping before allreduce which allows us to benefit from the overlap of gradient computation and synchronization as well as the fast training of gradient clipping before allreduce. We also re-evaluate existing optimizers via hyperparameter optimization and utilize ADAM, which also contributes to fast training via larger batches than existing methods. Our proposed methods, all combined, give the fastest MLPerf BERT training of 25.1 (22.3) seconds on 1,024 NVIDIA A100 GPUs, which is 1.33x (1.13x) and 1.57x faster than the other top two (one) submissions to MLPerf v1.1 (v2.0). Our implementation and evaluation results are available at MLPerf v1.1~v2.1.

Breaking MLPerf Training: A Case Study on Optimizing BERT

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 11 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 1 equation, 11 figures, 5 tables, 1 algorithm.

Introduction
Background
Data Parallelism
Large Batch Training
Gradient Clipping
BERT Training in MLPerf Benchmark
Motivation
Irregular Sequence Length in NLP Datasets
Gradient Clipping After/Before AllReduce
Proposed Method
Local Presorting with Dataset Stratification
Bucket-wise Gradient Clipping before AllReduce
Experiments
MLPerf BERT Benchmark Results
Irregular Sequence Length Handling
...and 10 more sections

Figures (11)

Figure 1: Illustrative comparison of load balancing. Each horizontal bar represents a sequence and its length is proportional to the number of tokens of the sequence. Numbers below bars represent per-GPU token counts, i.e., the total sum of sample token counts processed by the associated GPU. Larger difference between maximum and minimum token count implies higher imbalance.
Figure 2: An illustration of two possible ways to implement gradient clipping in distributed training: (a) gradient clipping after allreduce and (b) gradient clipping before allreduce.
Figure 3: Example of stratification when dataset is stratified into four strata and batch size is 16.
Figure 4: The example of local presorting with two different retrieving patterns: (a) raster and (b) snake scanning.
Figure 5: Timeline of gradient clipping methods: (a) gradient clipping after allreduce, (b) gradient clipping before allreduce, and (c) bucket-wise gradient clipping.
...and 6 more figures

Breaking MLPerf Training: A Case Study on Optimizing BERT

TL;DR

Abstract

Breaking MLPerf Training: A Case Study on Optimizing BERT

Authors

TL;DR

Abstract

Table of Contents

Figures (11)