Boosting Distributed Training Performance of the Unpadded BERT Model

Jinle Zeng; Min Li; Zhihua Wu; Jiaqi Liu; Yuang Liu; Dianhai Yu; Yanjun Ma

Boosting Distributed Training Performance of the Unpadded BERT Model

Jinle Zeng, Min Li, Zhihua Wu, Jiaqi Liu, Yuang Liu, Dianhai Yu, Yanjun Ma

TL;DR

The paper tackles the padding-induced inefficiencies in distributed BERT training by introducing an unpadded BERT model that supports variable-length inputs. It combines grouped multi-stream FMHA, CPU-assisted padding exchange, and extensive kernel/operator optimizations (kernel fusion, LAMB, embedding) to achieve large throughput gains and better load balancing. Ablation experiments quantify the contribution of each component, and MLPerf v2.0 results show competitive end-to-end times and leading throughput on 8× A100 GPUs. The approach delivers state-of-the-art performance and is extensible to other Transformer-based models, with practical implications for large-scale NLP pre-training.

Abstract

Pre-training models are an important tool in Natural Language Processing (NLP), while the BERT model is a classic pre-training model whose structure has been widely adopted by followers. It was even chosen as the reference model for the MLPerf training benchmark. The distributed training performance optimization of BERT models plays an important role in accelerating the solutions of most NLP tasks. BERT model often uses padding tensors as its inputs, leading to excessive redundant computations. Thus, removing these redundant computations is essential to improve the distributed training performance. This paper designs a new approach to train BERT models with variable-length inputs efficiently. Firstly, we propose a general structure for the variable-length BERT models, and accelerate the encoder layer via our grouped multi-stream FMHA (Fused Multi-Head Attention) method. Secondly, through data exchange, we address the unbalanced workload problem caused by the variable-length inputs, which overlaps highly with the training process. Finally, we optimize the overall performance of the BERT model, such as kernel fusion, and operator optimization. Our experimental results show that our highly optimized BERT model achieves state-of-the-art throughput and ranks first in MLPerf Training v2.0 within the same GPU configuration. The optimizations in this paper can be applied to more BERT-like models in our future works.

Boosting Distributed Training Performance of the Unpadded BERT Model

TL;DR

Abstract

Paper Structure (27 sections, 1 equation, 15 figures, 4 tables)

This paper contains 27 sections, 1 equation, 15 figures, 4 tables.

Introduction
Related Works
Background
Overview of the BERT Model
Training Strategy of the BERT Model
Challenges of the Variable-Length BERT Optimization
Computational Optimizations for Variable-Length Inputs
Load Balance in Distributed Training
Implementation and Optimization
Supporting and Optimizing for Variable-length Workloads
Supporting Unpad Computing
Optimizing Unpad Attention Computing
Load Balance Optimization in Distributed Training
Padding exchange across the workers
Overlapping of the Padding Removal Process and the GPU Training
...and 12 more sections

Figures (15)

Figure 1: Structure of the BERT encoder.
Figure 2: The Training Process of the BERT Model.
Figure 3: The Data Parallel Training of BERT Model.
Figure 4: The distribution of the sequence length for Wikipedia data set.
Figure 5: The unbalanced loads of the unpadded BERT model.
...and 10 more figures

Boosting Distributed Training Performance of the Unpadded BERT Model

TL;DR

Abstract

Boosting Distributed Training Performance of the Unpadded BERT Model

Authors

TL;DR

Abstract

Table of Contents

Figures (15)