Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Zhongyi Lin; Ning Sun; Pallab Bhattacharya; Xizhou Feng; Louis Feng; John D. Owens

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Zhongyi Lin, Ning Sun, Pallab Bhattacharya, Xizhou Feng, Louis Feng, John D. Owens

TL;DR

This work addresses challenges and enables multi-GPU performance modeling by incorporating data-distribution-aware performance models for embedding table lookup and data movement prediction of communication collectives into an upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms.

Abstract

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms. Beyond accurately predicting the per-iteration training time of DLRM models with random configurations with a geomean error of 5.21% on two multi-GPU platforms, our prediction pipeline generalizes well to other types of ML workloads, such as Transformer-based NLP models with a geomean error of 3.00%. Moreover, even without actually running ML workloads like DLRMs on the hardware, it is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration (with a success rate of 85%).

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

TL;DR

Abstract

Paper Structure (21 sections, 4 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 4 equations, 11 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Performance Modeling for Both Single-Device and Distributed Training
Model Parallelism and Sharding
Methodology
Dataset Exploration
Multi-GPU Training Benchmark and Analysis
Kernel and End-to-End Performance Modeling for ML Training on Multi-GPU
Communication Collective Performance Modeling
Multi-GPU E2E Performance Modeling
Embedding Lookup Kernel Modeling
Additional Ops Performance Modeling Support
Evaluation and Analysis
Kernel Performance Modeling
Multi-GPU E2E Performance Modeling
...and 6 more sections

Figures (11)

Figure 1: An overview of our prediction pipeline, based on the one proposed by Lin et al. Lin:2022:BAP. We mark new components with small darker shapes and italic text, such as microbenchmark and kernel-level performance models of FBGEMM embedding lookup (Section \ref{['sec:el']}), all-to-all (A2A), and all-reduce (AR) (Section \ref{['sec:comm']}), as well as inter- and intra-rank synchronization mechanism in the critical-path algorithm (Section \ref{['sec:multi_e2e']}) to handle multi-GPU end-to-end performance prediction.
Figure 2: Per-GPU-stream training execution time breakdown of selected ML workloads on a 4-GPU platform. The per-iteration time of each workload is provided for reference. We discuss these results in Section \ref{['sec:bench']}.
Figure 3: Histogram of average $L$ values of EL tables in the dataset. All bins but the last one are right-open.
Figure 4: Typical characteristic curves for data movement. X-axes are in log scale. $m_1$ and $m_2$ are boundary message sizes that separate the three regions.
Figure 5: Inter-rank synchronization (red dashed line) and intra-rank/inter-stream synchronization (blue dotted line). For simplicity, we assume two GPUs and two streams ($S_{cp}$ and $S_{cm}$, for compute and communication respectively) per GPU, while CPU op calls are omitted in the plot. Rectangles represent GPU kernels, and arrows indicate the data dependency between compute and communication kernels.
...and 6 more figures

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

TL;DR

Abstract

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Authors

TL;DR

Abstract

Table of Contents

Figures (11)