Table of Contents
Fetching ...

Unicron: Economizing Self-Healing LLM Training at Scale

Tao He, Xue Li, Zhibin Wang, Kun Qian, Jingbo Xu, Wenyuan Yu, Jingren Zhou

TL;DR

Unicron tackles the costly downtime and inefficiency of failure recovery in large-scale LLM training on cloud clusters. By integrating in-band error detection, a cost-aware dynamic plan generator, and an efficient transition strategy within a Megatron-based workflow, it optimizes the allocation of GPU resources across multiple concurrent tasks to minimize failure-related costs. The approach uses a weighted aggregate FLOP/s metric (WAF) and dynamic programming to compute optimal reconfiguration plans, while reusing partial results to minimize downtime during transitions. Empirical results on a 128-GPU cluster show up to 1.9x improvements in overall training efficiency and substantial savings in failure recovery costs, validating Unicron’s practical impact for resilient, scalable LLM training.

Abstract

Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.

Unicron: Economizing Self-Healing LLM Training at Scale

TL;DR

Unicron tackles the costly downtime and inefficiency of failure recovery in large-scale LLM training on cloud clusters. By integrating in-band error detection, a cost-aware dynamic plan generator, and an efficient transition strategy within a Megatron-based workflow, it optimizes the allocation of GPU resources across multiple concurrent tasks to minimize failure-related costs. The approach uses a weighted aggregate FLOP/s metric (WAF) and dynamic programming to compute optimal reconfiguration plans, while reusing partial results to minimize downtime during transitions. Empirical results on a 128-GPU cluster show up to 1.9x improvements in overall training efficiency and substantial savings in failure recovery costs, validating Unicron’s practical impact for resilient, scalable LLM training.

Abstract

Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.
Paper Structure (27 sections, 7 equations, 11 figures, 3 tables)

This paper contains 27 sections, 7 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Distribution of task termination statistics.
  • Figure 2: Training process with manual failure recovery.
  • Figure 3: Throughput and FLOP/s reduction of training the GPT-3 7B model on a cluster of 64 GPUs for 7 days with 10 node fault errors occurred during the period. (a) The throughput is the number of samples the system can process per second. (b) The theoretical reduction is the ratio of hardware resources due to the unavailability during failures. For each system, the reduction is the percentage of lost FLOP/s compared with the ideal FLOP/s it can achieved assuming no failure happens.
  • Figure 4: Achieved FLOP/s ratio and aggregate FLOP/s for training varying-sized GPT-3 models using Megatron.
  • Figure 5: The system architecture of Unicron.
  • ...and 6 more figures