Table of Contents
Fetching ...

Robust LLM Training Infrastructure at ByteDance

Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang

TL;DR

Large-scale LLM pretraining faces frequent hardware and software faults that disrupt months-long runs. ByteRobust presents a two-plane (control/data) infrastructure with automated fault tolerance, combining real-time checks, hierarchical stop-time diagnostics, data-driven over-eviction, in-place hot-updates, warm standby backups, and eviction-aware checkpointing to maximize ETTR. In production it achieves up to 97% ETTR on 9,600 GPUs, dramatically reducing failure recovery time and enabling rapid iteration during ongoing code evolution. The system demonstrates significant MFU gains and near-zero-overhead checkpoints, offering a practical pathway to reliable, continuous large-scale LLM training at industry scale.

Abstract

The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform and achieves 97% ETTR for a three-month training job on 9,600 GPUs.

Robust LLM Training Infrastructure at ByteDance

TL;DR

Large-scale LLM pretraining faces frequent hardware and software faults that disrupt months-long runs. ByteRobust presents a two-plane (control/data) infrastructure with automated fault tolerance, combining real-time checks, hierarchical stop-time diagnostics, data-driven over-eviction, in-place hot-updates, warm standby backups, and eviction-aware checkpointing to maximize ETTR. In production it achieves up to 97% ETTR on 9,600 GPUs, dramatically reducing failure recovery time and enabling rapid iteration during ongoing code evolution. The system demonstrates significant MFU gains and near-zero-overhead checkpoints, offering a practical pathway to reliable, continuous large-scale LLM training at industry scale.

Abstract

The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform and achieves 97% ETTR for a three-month training job on 9,600 GPUs.

Paper Structure

This paper contains 31 sections, 2 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Recipe of LLM pretraining. TextPT: Text Pretraining; MMCT: Multimodal Mixed Continual Training; ReasonCT: Reasoning Continual Training; LongCT: Long Context Continual Training; AnnealCT: Annealing Continual Training. Different LLMs may reorder stages kimi-k2llama3.1.
  • Figure 2: Normalized loss and relative MFU (ratio to the minimum MFU value) curves of an LLM training job running on 1000 GPUs in a production environment. Each color indicates one continuous, uninterrupted training period.
  • Figure 3: Unproductive time breakdown upon failures. Implicit failures, such as job hangs, are selected as examples since they typically result in prolonged unproductive times.
  • Figure 4: Architecture of ByteRobust.
  • Figure 5: The automated fault tolerance mechanism of ByteRobust.
  • ...and 7 more figures