Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
Mingyu Liang, Hiwot Tadese Kassa, Wenyin Fu, Brian Coutinho, Louis Feng, Christina Delimitrou
TL;DR
Lumos addresses the challenge of efficiently optimizing large-scale LLM training by introducing a trace-driven performance modeling toolkit. It builds a fine-grained execution graph from runtime traces (via Kineto), supports graph manipulation to reflect alternative parallelism and architectural configurations, and uses simulation to replay or predict performance without on-hardware experimentation. The approach achieves an average replay error of 3.3% across GPT-3 variants on up to 512 NVIDIA H100 GPUs, and demonstrates accurate estimation for new configurations by adjusting data, tensor, and pipeline parallelism, as well as model architectures. This enables faster, cost-effective exploration of optimization opportunities and provides detailed insights into execution breakdowns and SM utilization. The work offers practical value for developers and system researchers seeking to optimize distributed LLM training pipelines with fine-grained, what-if analyses.
Abstract
Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving high efficiency in practice remains difficult. Accurate performance models that effectively characterize and predict a model's behavior are essential for guiding optimization efforts and system-level studies. We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training, designed to accurately capture and predict the execution behaviors of modern LLMs. We evaluate Lumos on a production ML cluster with up to 512 NVIDIA H100 GPUs using various GPT-3 variants, demonstrating that it can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations. Additionally, we validate its ability to estimate performance for new setups from existing traces, facilitating efficient exploration of model and deployment configurations.
