Forecasting GPU Performance for Deep Learning Training and Inference

Seonho Lee; Amar Phanishayee; Divya Mahajan

Forecasting GPU Performance for Deep Learning Training and Inference

Seonho Lee, Amar Phanishayee, Divya Mahajan

TL;DR

NeuSight tackles the challenge of forecasting deep learning performance on unseen GPUs by decomposing kernel execution into tiles that are executed across SMs. A set of per-tile ML predictors estimate device utilization, which, when bounded by fundamental performance laws (e.g., roofline), yields robust per-kernel latency estimates that aggregate to end-to-end and distributed latency predictions. The framework supports operator fusion and multiple parallelism strategies, enabling accurate predictions across training, inference, and multi-GPU setups, including cross-vendor GPUs. Empirical results show NeuSight substantially reduces predictive error relative to prior methods, including strong performance on out-of-distribution GPUs and models, and the authors provide an open-source implementation for broad adoption.

Abstract

Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 121.4% and 30.8% to 2.3% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior work, where both GPT3 and H100 were not used to train the framework.

Forecasting GPU Performance for Deep Learning Training and Inference

TL;DR

Abstract

Paper Structure (34 sections, 8 equations, 10 figures, 9 tables)

This paper contains 34 sections, 8 equations, 10 figures, 9 tables.

Introduction
Background
GPU Architecture
Deep Learning Execution
Motivation
Predicting Performance of Batched Matrix Multiplication Using prior work
Predicting Performance of Batched Matrix Multiplication with Larger Predictors
Other Related Works
NeuSight Forecasting
Kernel Execution on GPUs
Kernel-wise Prediction
Machine Learning Model to Predict Utilization
Support for Operator Fusion
NeuSight Workflow
Forecasting for Distributed Execution
...and 19 more sections

Figures (10)

Figure 1: Growth of AI models and the compute and memory capacity of GPUs. nvidia_gpusamd_gpusalexnetvggbertgpt2t5gpt3megatronnlg
Figure 2: Prediction error of prior work on BMM operator, reported in percentage error. Out of distribution dimensions and GPUs are highlighted. For these results, we trained Habitat and Li et al. models only up to V100, excluding any A100s, H100, and L4.
Figure 3: Dataflow of a GEMM on a GPU. We assume multiplication between two 4x4 matrices and tile size of 2x2.
Figure 4: Each tile is distributed evenly across SMs and executed concurrently, in multiple number of waves. gpuperformancebackground
Figure 5: Performance of (256 × 256) × (256 × 256) matrix multiplication with varying waves on V100; sweeping the number of waves by changing the batch size from 1 to 300.
...and 5 more figures

Forecasting GPU Performance for Deep Learning Training and Inference

TL;DR

Abstract

Forecasting GPU Performance for Deep Learning Training and Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (10)