Table of Contents
Fetching ...

Characterization of Large Language Model Development in the Datacenter

Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang

TL;DR

The paper analyzes LLM development workloads in a private GPU datacenter to reveal how these workloads differ from prior DL traces, highlighting short job durations, GPU-dominated resource usage, and prevalent infrastructure failures. It provides a six-month trace from two homogeneous clusters (Seren and Kalos) and examines data, compute, and environmental aspects, showing skewed workload distributions and high queuing delays for evaluation due to pretraining resource reservation. To address these challenges, the authors implement two deployed systems: fault-tolerant pretraining featuring asynchronous checkpointing and LLM-assisted failure diagnosis, and decoupled scheduling for evaluation to accelerate timely feedback. The work offers concrete, system-level strategies for optimizing LLM development in datacenters and shares traces and code to support broader research on robust, efficient large-scale ML infrastructure.

Abstract

Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures. Our analysis summarizes hurdles we encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining, which enhances fault tolerance through LLM-involved failure diagnosis and automatic recovery. (2) decoupled scheduling for evaluation, which achieves timely performance feedback via trial decomposition and scheduling optimization.

Characterization of Large Language Model Development in the Datacenter

TL;DR

The paper analyzes LLM development workloads in a private GPU datacenter to reveal how these workloads differ from prior DL traces, highlighting short job durations, GPU-dominated resource usage, and prevalent infrastructure failures. It provides a six-month trace from two homogeneous clusters (Seren and Kalos) and examines data, compute, and environmental aspects, showing skewed workload distributions and high queuing delays for evaluation due to pretraining resource reservation. To address these challenges, the authors implement two deployed systems: fault-tolerant pretraining featuring asynchronous checkpointing and LLM-assisted failure diagnosis, and decoupled scheduling for evaluation to accelerate timely feedback. The work offers concrete, system-level strategies for optimizing LLM development in datacenters and shares traces and code to support broader research on robust, efficient large-scale ML infrastructure.

Abstract

Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures. Our analysis summarizes hurdles we encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining, which enhances fault tolerance through LLM-involved failure diagnosis and automatic recovery. (2) decoupled scheduling for evaluation, which achieves timely performance feedback via trial decomposition and scheduling optimization.
Paper Structure (32 sections, 22 figures, 3 tables)

This paper contains 32 sections, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Overview of the LLM development pipeline.
  • Figure 2: Overview of different datacenter characteristics. (a) Workload: CDF of the GPU job duration. (b) Infrastructure: CDF of GPU utilization, where Helios' data is not available.
  • Figure 3: Comparison of workload distribution based on the number of requested GPUs. (a) CDF of job count. (b) CDF of GPU time (i.e., requested GPU number $\times$ duration).
  • Figure 4: Distribution of different workload types in Seren (a, b) and Kalos (c, d). Note that CPU jobs are excluded. SFT: Supervised Fine-Tuning for model alignment. MLLM: Multimodal Large Language Model. Other: Unclassified jobs.
  • Figure 5: The boxplot of the distribution of GPU demand across different workload types in Seren (a) and Kalos (b).
  • ...and 17 more figures