vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

Jehyeon Bang; Yujeong Choi; Myeongwoo Kim; Yongdeok Kim; Minsoo Rhu

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, Yongdeok Kim, Minsoo Rhu

TL;DR

This paper presents a profiling-driven simulator called vTrain, providing AI practitioners a fast yet accurate software framework to determine an efficient and cost-effective LLM training system configuration, and demonstrates its practicality through several case studies.

Abstract

As large language models (LLMs) become widespread in various application domains, a critical challenge the AI community is facing is how to train these large AI models in a cost-effective manner. Existing LLM training plans typically employ a heuristic based parallel training strategy which is based on empirical observations rather than grounded upon a thorough examination of the search space of LLM parallelization. Such limitation renders existing systems to leave significant performance left on the table, wasting millions of dollars worth of training cost. This paper presents our profiling-driven simulator called vTrain, providing AI practitioners a fast yet accurate software framework to determine an efficient and cost-effective LLM training system configuration. We demonstrate vTrain's practicality through several case studies, e.g., effectively evaluating optimal training parallelization strategies that balances training time and its associated training cost, efficient multi-tenant GPU cluster schedulers targeting multiple LLM training jobs, and determining a compute-optimal LLM model architecture given a fixed compute budget.

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 2 equations, 14 figures, 5 tables, 1 algorithm.

Introduction
Background
Transformer-based Large Language Models (LLMs)
LLM Parallelization Strategies
vTrain: A Profiling-based Software Framework for Simulating LLM Training Time
Simulation Framework Overview
(High-level) Operator-granularity Execution Graph
Profiling Module
(Low-level) Task-granularity Execution Graph
Simulation Algorithm for Estimating Training Time
Profiling Cost and Simulation Speed
Simulation and Validation Methodology
Design Space Exploration for Cost-Effective and Compute-Optimal LLM Training using vTrain
Case Study #1: Cost-effective LLM Training Plan
Case Study #2: Cost-effective Multi-tenant GPU Clusters
...and 3 more sections

Figures (14)

Figure 1: Wall clock training time of GPT-3 (175B parameters) as a function of GPU compute utilization, assuming 1,024 NVIDIA A100 GPUs are used for training. GPU compute utilization refers to the achieved FLOPS relative to the maximum FLOPS. Training time is primarily determined by dividing up "the total number of FLOPs to train an LLM" with "the aggregate, effective FLOPS available for training across the 1,024 A100 GPUs". We estimate training time by changing the effective FLOPS of A100 GPUs and derive the training cost based on AWS EC2 P4d GPU instance pricing information amazon_p4d.
Figure 2: A transformer-based, decoder-only LLM architecture.
Figure 3: An LLM training system employing 3D parallelism. Example combines 4-way tensor parallelism (intra-node $4$ GPUs invoking the yellow colored All-Reduce), 2-way data parallelism (the node pairs invoking the gray colored inter-node All-Reduce), and 3-way pipeline parallelism (the three nodes [0,1,2] and [3,4,5] invoking the orange colored inter-node Send-Receive). In the rest of this paper, a (t, d, p)-way 3D parallelism refers to a training system configuration employing t-way tensor, d-way data, and p-way pipeline parallelism, i.e., example illustrates (4,2,3)-way 3D parallelism.
Figure 4: Key components of vTrain and its simulation flow.
Figure 5: Inserting All-Reduce operators for data parallel training when gradient bucketing is (a) enabled and (b) disabled. "Bwd i" represents the $i^{th}$ layer's backward pass and "WU" refers to the weight update pass. Example in (a) assumes that layer ($1\&2$) and ($3\&4$) are grouped into a bucket.
...and 9 more figures

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

TL;DR

Abstract

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

Authors

TL;DR

Abstract

Table of Contents

Figures (14)