Table of Contents
Fetching ...

FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training

Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, Liping Zhang

TL;DR

FALCON is proposed, a framework that rapidly identifies fail-slowed GPUs and/or communication links, and effectively tackles them with a novel multi-level mitigation mechanism, all without human intervention.

Abstract

Fail-slows, or stragglers, are common but largely unheeded problems in large-scale hybrid-parallel training that spans thousands of GPU servers and runs for weeks to months. Yet, these problems are not well studied, nor can they be quickly detected and effectively mitigated. In this paper, we first present a characterization study on a shared production cluster with over 10,000 GPUs1. We find that fail-slows are caused by various CPU/GPU computation and cross-node networking issues, lasting from tens of seconds to nearly ten hours, and collectively delaying the average job completion time by 1.34%. The current practice is to manually detect these fail-slows and simply treat them as fail-stops using a checkpoint-and-restart failover approach, which are labor-intensive and time-consuming. In this paper, we propose FALCON, a framework that rapidly identifies fail-slowed GPUs and/or communication links, and effectively tackles them with a novel multi-level mitigation mechanism, all without human intervention. We have applied FALCON to detect human-labeled fail-slows in a production cluster with over 99% accuracy. Cluster deployment further demonstrates that FALCON effectively handles manually injected fail-slows, mitigating the training slowdown by 60.1%.

FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training

TL;DR

FALCON is proposed, a framework that rapidly identifies fail-slowed GPUs and/or communication links, and effectively tackles them with a novel multi-level mitigation mechanism, all without human intervention.

Abstract

Fail-slows, or stragglers, are common but largely unheeded problems in large-scale hybrid-parallel training that spans thousands of GPU servers and runs for weeks to months. Yet, these problems are not well studied, nor can they be quickly detected and effectively mitigated. In this paper, we first present a characterization study on a shared production cluster with over 10,000 GPUs1. We find that fail-slows are caused by various CPU/GPU computation and cross-node networking issues, lasting from tens of seconds to nearly ten hours, and collectively delaying the average job completion time by 1.34%. The current practice is to manually detect these fail-slows and simply treat them as fail-stops using a checkpoint-and-restart failover approach, which are labor-intensive and time-consuming. In this paper, we propose FALCON, a framework that rapidly identifies fail-slowed GPUs and/or communication links, and effectively tackles them with a novel multi-level mitigation mechanism, all without human intervention. We have applied FALCON to detect human-labeled fail-slows in a production cluster with over 99% accuracy. Cluster deployment further demonstrates that FALCON effectively handles manually injected fail-slows, mitigating the training slowdown by 60.1%.

Paper Structure

This paper contains 27 sections, 13 equations, 20 figures, 7 tables, 1 algorithm.

Figures (20)

  • Figure 1: Left: Occurrence rate of fail-slows on computation and communication at node or link level and in large-scale training. Center: Impact of fail-slows on job completion time (JCT). Right: CDF of fail-slow duration.
  • Figure 2: A case of a fail-slow job due to CPU contention. Upper-left: Training throughput. Upper-right: GPU SM utilization of the four GPUs used by this job. Bottom-left: The number of high-CPU jobs running on the same node. Bottom-right: CPU satisfaction rate of the training job (red) and other colocated jobs (blue).
  • Figure 3: A case of a fail-slow job due to GPU performance degradation. Upper-left: Throughput of the training job. Upper-right: GPU SM utilization of the four GPUs used. Bottom-left: Normalized GPU performance during fail-slow. Bottom-right: Reported GPU temperature.
  • Figure 4: A case of fail-slow jobs caused by network congestion. Left: Training throughput. Center: The number of congestion notification packets ($\times 1000$) sent by NICs. Right: Average GPU SM utilization of the 8 GPUs used by the job.
  • Figure 5: Two 1024-GPU jobs that failed slow due to network congestion. Left: An LLM training job. Right: An MoE training job with high variance and ladder-shaped fail-slow.
  • ...and 15 more figures