Table of Contents
Fetching ...

Minder: Faulty Machine Detection for Large-scale Distributed Model Training

Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, Fuliang Li, Shuguang Wang, Haibin Lin, Jianxi Ye, Minlan Yu

TL;DR

Minder addresses the critical problem of detecting faulty machines during large-scale distributed model training, where faults can cause long halts and extensive labor costs. It proposes an unsupervised framework based on machine-level similarity, fault continuity, per-metric denoising via LSTM-VAE, and prioritized metrics to enable rapid, accurate detection without interrupting training. The system is deployed in production, demonstrating fast alerting (3.6s) and strong accuracy (precision 0.904, F1 0.893) relative to a Mahalanobis-distance baseline, and it effectively handles multiple fault types with a scalable, modular design. The work offers practical impact for real-world AI training infrastructure and suggests directions for broader applicability, richer metrics, and enhanced root-cause analysis.

Abstract

Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.

Minder: Faulty Machine Detection for Large-scale Distributed Model Training

TL;DR

Minder addresses the critical problem of detecting faulty machines during large-scale distributed model training, where faults can cause long halts and extensive labor costs. It proposes an unsupervised framework based on machine-level similarity, fault continuity, per-metric denoising via LSTM-VAE, and prioritized metrics to enable rapid, accurate detection without interrupting training. The system is deployed in production, demonstrating fast alerting (3.6s) and strong accuracy (precision 0.904, F1 0.893) relative to a Mahalanobis-distance baseline, and it effectively handles multiple fault types with a scalable, modular design. The work offers practical impact for real-world AI training infrastructure and suggests directions for broader applicability, richer metrics, and enhanced root-cause analysis.

Abstract

Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.

Paper Structure

This paper contains 29 sections, 1 equation, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Fault frequency of tasks with different machine scale sizes.
  • Figure 2: Time for task diagnosis in seven months.
  • Figure 3: PFC tx packet rate pattern for each machine before and after a fault occurs.
  • Figure 4: Duration of abnormal performance following a fault.
  • Figure 5: System architecture of Minder.
  • ...and 11 more figures