Table of Contents
Fetching ...

Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training

Mingyu Liang, Hiwot Tadese Kassa, Wenyin Fu, Brian Coutinho, Louis Feng, Christina Delimitrou

TL;DR

Lumos addresses the challenge of efficiently optimizing large-scale LLM training by introducing a trace-driven performance modeling toolkit. It builds a fine-grained execution graph from runtime traces (via Kineto), supports graph manipulation to reflect alternative parallelism and architectural configurations, and uses simulation to replay or predict performance without on-hardware experimentation. The approach achieves an average replay error of 3.3% across GPT-3 variants on up to 512 NVIDIA H100 GPUs, and demonstrates accurate estimation for new configurations by adjusting data, tensor, and pipeline parallelism, as well as model architectures. This enables faster, cost-effective exploration of optimization opportunities and provides detailed insights into execution breakdowns and SM utilization. The work offers practical value for developers and system researchers seeking to optimize distributed LLM training pipelines with fine-grained, what-if analyses.

Abstract

Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving high efficiency in practice remains difficult. Accurate performance models that effectively characterize and predict a model's behavior are essential for guiding optimization efforts and system-level studies. We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training, designed to accurately capture and predict the execution behaviors of modern LLMs. We evaluate Lumos on a production ML cluster with up to 512 NVIDIA H100 GPUs using various GPT-3 variants, demonstrating that it can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations. Additionally, we validate its ability to estimate performance for new setups from existing traces, facilitating efficient exploration of model and deployment configurations.

Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training

TL;DR

Lumos addresses the challenge of efficiently optimizing large-scale LLM training by introducing a trace-driven performance modeling toolkit. It builds a fine-grained execution graph from runtime traces (via Kineto), supports graph manipulation to reflect alternative parallelism and architectural configurations, and uses simulation to replay or predict performance without on-hardware experimentation. The approach achieves an average replay error of 3.3% across GPT-3 variants on up to 512 NVIDIA H100 GPUs, and demonstrates accurate estimation for new configurations by adjusting data, tensor, and pipeline parallelism, as well as model architectures. This enables faster, cost-effective exploration of optimization opportunities and provides detailed insights into execution breakdowns and SM utilization. The work offers practical value for developers and system researchers seeking to optimize distributed LLM training pipelines with fine-grained, what-if analyses.

Abstract

Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving high efficiency in practice remains difficult. Accurate performance models that effectively characterize and predict a model's behavior are essential for guiding optimization efforts and system-level studies. We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training, designed to accurately capture and predict the execution behaviors of modern LLMs. We evaluate Lumos on a production ML cluster with up to 512 NVIDIA H100 GPUs using various GPT-3 variants, demonstrating that it can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations. Additionally, we validate its ability to estimate performance for new setups from existing traces, facilitating efficient exploration of model and deployment configurations.

Paper Structure

This paper contains 24 sections, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Execution breakdown for one training iteration of GPT-3 175B, configured with tensor parallelism = 8, pipeline parallelism = 4, and data parallelism = 8.
  • Figure 2: Overview of Lumos's workflow.
  • Figure 3: Four types of dependencies between the tasks.
  • Figure 4: Updated pipeline schedule for rank_0 with 2x PP, assuming the number of micro-batches is equal to TP × PP and 1F1B scheduling policy narayanan2021efficient.
  • Figure 5: Per-iteration training time with its breakdown across various model sizes and parallelism strategies: comparison of actual execution, dPRO, and Lumos.
  • ...and 3 more figures