Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Joyjit Kundu; Wenzhe Guo; Ali BanaGozar; Udari De Alwis; Sourav Sengupta; Puneet Gupta; Arindam Mallik

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Joyjit Kundu, Wenzhe Guo, Ali BanaGozar, Udari De Alwis, Sourav Sengupta, Puneet Gupta, Arindam Mallik

TL;DR

The paper presents Optimus, an end-to-end analytical framework for performance modeling of distributed large language model training and inference that integrates compute, memory sub-systems, and network with various parallelization strategies. It extends prior work by incorporating activation recomputation, KV-cache dynamics, and a Megatron-inspired mapping within a roofline-based, architecture-aware model, validated against published GEMM/GEMV data and real LLM workloads. Core contributions include a design space exploration framework, comprehensive validation across GPUs and architectures, and case studies showing how scaling in logic, memory, and memory technology shifts bottlenecks from compute to memory and then to interconnect bandwidth. The framework provides actionable insights for hardware-software co-design, guiding decisions on parallelization, memory management, and network infrastructure to optimize training and inference efficiency and total cost of operation.

Abstract

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of distributed LLM training and inference through an analytical framework that accurately considers compute, memory sub-system, network, and various parallelization strategies (model parallel, data parallel, pipeline parallel, and sequence parallel). We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA). For distributed training, we investigate the memory footprint of LLMs for different activation re-computation methods, dissect the key factors behind the massive performance gain from A100 to B200 ($\sim$ 35x speed-up closely following NVIDIA's scaling trend), and further run a design space exploration at different technology nodes (12 nm to 1 nm) to study the impact of logic, memory, and network scaling on the performance. For inference, we analyze the compute versus memory boundedness of different operations at a matrix-multiply level for different GPU systems and further explore the impact of DRAM memory technology scaling on inference latency. Utilizing our modeling framework, we reveal the evolution of performance bottlenecks for both LLM training and inference with technology scaling, thus, providing insights to design future systems for LLM training and inference.

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

TL;DR

Abstract

35x speed-up closely following NVIDIA's scaling trend), and further run a design space exploration at different technology nodes (12 nm to 1 nm) to study the impact of logic, memory, and network scaling on the performance. For inference, we analyze the compute versus memory boundedness of different operations at a matrix-multiply level for different GPU systems and further explore the impact of DRAM memory technology scaling on inference latency. Utilizing our modeling framework, we reveal the evolution of performance bottlenecks for both LLM training and inference with technology scaling, thus, providing insights to design future systems for LLM training and inference.

Paper Structure (24 sections, 4 equations, 9 figures, 4 tables)

This paper contains 24 sections, 4 equations, 9 figures, 4 tables.

Introduction & Background
Transformers
Performance Bottlenecks
Parallelization
Related Work, Trends and Gap areas
Methodology
Framework overview
Mapping & Parallelization strategy
Activation recomputation
Modeling All-to-All communication
KV-cache modeling
Design space exploration framework
Validation
Distributed GEMM and GEMV validation
LLM training validation for GPUs
...and 9 more sections

Figures (9)

Figure 1: Overview of our performance modeling framework: $\mu$Arch engine generates a microarchitecture from the inputs. The architecture abstraction layer constructs a high-level representation of the underlying architecture. Given an LLM workload, the framework builds a task graph and parallelizes across multiple devices based on mapping. The performance prediction engine predicts the execution time.
Figure 2: The model parallelism strategy proposed in Megatron-LM paper Shoeybi2019MegatronLMTM effectively reduces the need for synchronization and communication.
Figure 3: Correlation between GPU runtime and our prediction for GEMV validation on a single A100 GPU.
Figure 4: Memory breakdown for training GPT models. The dash line indicates the NVIDIA A100 memory capacity, 80 GB. For each GPT model, three activation recomputation methods are compared: no recomputation, selective recomputation, and full recomputation.
Figure 5: Training performance scaling across multiple GPU generations for GPT-3 175B. Training times are normalized against that of B200-NVS-L. The A100 cluster is connected through HDR InfiniBand network, while the others are configured with NDR infiniBand network or NVLink switch system (NVS). L indicates a larger batch size. Other includes weight update time + pipeline bubble time.
...and 4 more figures

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

TL;DR

Abstract

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (9)