Table of Contents
Fetching ...

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, Yongdeok Kim, Minsoo Rhu

TL;DR

This paper presents a profiling-driven simulator called vTrain, providing AI practitioners a fast yet accurate software framework to determine an efficient and cost-effective LLM training system configuration, and demonstrates its practicality through several case studies.

Abstract

As large language models (LLMs) become widespread in various application domains, a critical challenge the AI community is facing is how to train these large AI models in a cost-effective manner. Existing LLM training plans typically employ a heuristic based parallel training strategy which is based on empirical observations rather than grounded upon a thorough examination of the search space of LLM parallelization. Such limitation renders existing systems to leave significant performance left on the table, wasting millions of dollars worth of training cost. This paper presents our profiling-driven simulator called vTrain, providing AI practitioners a fast yet accurate software framework to determine an efficient and cost-effective LLM training system configuration. We demonstrate vTrain's practicality through several case studies, e.g., effectively evaluating optimal training parallelization strategies that balances training time and its associated training cost, efficient multi-tenant GPU cluster schedulers targeting multiple LLM training jobs, and determining a compute-optimal LLM model architecture given a fixed compute budget.

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

TL;DR

This paper presents a profiling-driven simulator called vTrain, providing AI practitioners a fast yet accurate software framework to determine an efficient and cost-effective LLM training system configuration, and demonstrates its practicality through several case studies.

Abstract

As large language models (LLMs) become widespread in various application domains, a critical challenge the AI community is facing is how to train these large AI models in a cost-effective manner. Existing LLM training plans typically employ a heuristic based parallel training strategy which is based on empirical observations rather than grounded upon a thorough examination of the search space of LLM parallelization. Such limitation renders existing systems to leave significant performance left on the table, wasting millions of dollars worth of training cost. This paper presents our profiling-driven simulator called vTrain, providing AI practitioners a fast yet accurate software framework to determine an efficient and cost-effective LLM training system configuration. We demonstrate vTrain's practicality through several case studies, e.g., effectively evaluating optimal training parallelization strategies that balances training time and its associated training cost, efficient multi-tenant GPU cluster schedulers targeting multiple LLM training jobs, and determining a compute-optimal LLM model architecture given a fixed compute budget.
Paper Structure (18 sections, 2 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 2 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Wall clock training time of GPT-3 (175B parameters) as a function of GPU compute utilization, assuming 1,024 NVIDIA A100 GPUs are used for training. GPU compute utilization refers to the achieved FLOPS relative to the maximum FLOPS. Training time is primarily determined by dividing up "the total number of FLOPs to train an LLM" with "the aggregate, effective FLOPS available for training across the 1,024 A100 GPUs". We estimate training time by changing the effective FLOPS of A100 GPUs and derive the training cost based on AWS EC2 P4d GPU instance pricing information amazon_p4d.
  • Figure 2: A transformer-based, decoder-only LLM architecture.
  • Figure 3: An LLM training system employing 3D parallelism. Example combines 4-way tensor parallelism (intra-node $4$ GPUs invoking the yellow colored All-Reduce), 2-way data parallelism (the node pairs invoking the gray colored inter-node All-Reduce), and 3-way pipeline parallelism (the three nodes [0,1,2] and [3,4,5] invoking the orange colored inter-node Send-Receive). In the rest of this paper, a (t, d, p)-way 3D parallelism refers to a training system configuration employing t-way tensor, d-way data, and p-way pipeline parallelism, i.e., example illustrates (4,2,3)-way 3D parallelism.
  • Figure 4: Key components of vTrain and its simulation flow.
  • Figure 5: Inserting All-Reduce operators for data parallel training when gradient bucketing is (a) enabled and (b) disabled. "Bwd i" represents the $i^{th}$ layer's backward pass and "WU" refers to the weight update pass. Example in (a) assumes that layer ($1\&2$) and ($3\&4$) are grouped into a bucket.
  • ...and 9 more figures