Table of Contents
Fetching ...

BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training

Rui Li, Xiaoyun Zhi, Jinxin Chi, Menghan Yu, Lixin Huang, Jia Zhu, Weilun Zhang, Xing Ma, Wenjia Liu, Zhicheng Zhu, Daowen Luo, Zuquan Song, Xin Yin, Chao Xiang, Shuguang Wang, Wencong Xiao, Gene Cooperman

TL;DR

BootSeer analyzes initialization bottlenecks in large-scale LLM training using production data, revealing that startup phases can dominate GPU resource waste and hinder rapid development cycles. It identifies three bottlenecks—container image loading, dependency installation, and checkpoint resumption—and proposes three techniques: record-and-prefetch for images, environment snapshots for dependencies, and striped HDFS-FUSE for checkpoints. The framework includes a profiling system and production-ready optimizations, and it achieves substantial reductions in startup overhead while mitigating stragglers. In deployment, BootSeer demonstrates meaningful improvements in training throughput and developer productivity, highlighting the practical impact of targeted startup optimizations for scalable LLM training.

Abstract

Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and (c) model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and (c) striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead.

BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training

TL;DR

BootSeer analyzes initialization bottlenecks in large-scale LLM training using production data, revealing that startup phases can dominate GPU resource waste and hinder rapid development cycles. It identifies three bottlenecks—container image loading, dependency installation, and checkpoint resumption—and proposes three techniques: record-and-prefetch for images, environment snapshots for dependencies, and striped HDFS-FUSE for checkpoints. The framework includes a profiling system and production-ready optimizations, and it achieves substantial reductions in startup overhead while mitigating stragglers. In deployment, BootSeer demonstrates meaningful improvements in training throughput and developer productivity, highlighting the practical impact of targeted startup optimizations for scalable LLM training.

Abstract

Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and (c) model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and (c) striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead.

Paper Structure

This paper contains 22 sections, 13 figures.

Figures (13)

  • Figure 1: Startup process of an LLM training job. Steps marked with (Sync) indicate that all worker nodes must synchronize at that stage.
  • Figure 2: Startup impact of LLM training on job-level (left) and node-level (right) startup overheads. Box plot whiskers extend to two standard deviations, in order to exclude outliers.
  • Figure 3: Number of startup events per job (left y-axis) and number of jobs (right y-axis). Box plot whiskers extend to two standard deviations, in order to exclude outliers.
  • Figure 4: Breakdown of node-level startup overhead across different initialization stages. Box plot whiskers extend to two standard deviations, in order to exclude outliers.
  • Figure 5: The straggler effect increases with job scale. The y-axis shows the Max/Median ratio for each of a range of job scales. Box plot whiskers extend to two standard deviations in order to exclude outliers.
  • ...and 8 more figures