Table of Contents
Fetching ...

Revisiting Reliability in Large-Scale Machine Learning Research Clusters

Apostolos Kokolis, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zachary DeVito, Shubho Sengupta, Kalyan Saladi, Carole-Jean Wu

TL;DR

This paper tackles reliability in large-scale, multi-tenant ML training clusters by presenting 11 months of fleet data across two state-of-the-art environments, introducing a formal failure taxonomy, and modeling Mean Time to Failure ($MTTF$) as a function of GPU scale. It develops an analytical estimator for Effective Training Time Ratio ($ETTR$) and demonstrates practical mitigations, including lemon-node detection and adaptive network routing, to boost training productivity. The contributions advance a workload-agnostic, reliability-aware approach to cluster design, fault detection, and recovery, with actionable guidance for reducing wasted goodput and accelerating large-scale model training. The work highlights the asymmetrical impact of reliability on large versus small jobs and emphasizes network and software co-design as essential for sustainable AI research at exascale-like scales.

Abstract

Reliability is a fundamental challenge in operating large-scale machine learning (ML) infrastructures, particularly as the scale of ML models and training clusters continues to grow. Despite decades of research on infrastructure failures, the impact of job failures across different scales remains unclear. This paper presents a view of managing two large, multi-tenant ML clusters, providing quantitative analysis, operational experience, and our own perspective in understanding and addressing reliability concerns at scale. Our analysis reveals that while large jobs are most vulnerable to failures, smaller jobs make up the majority of jobs in the clusters and should be incorporated into optimization objectives. We identify key workload properties, compare them across clusters, and demonstrate essential reliability requirements for pushing the boundaries of ML training at scale. We hereby introduce a taxonomy of failures and key reliability metrics, analyze 11 months of data from two state-of-the-art ML environments with 4 million jobs and over 150 million A100 GPU hours. Building on our data, we fit a failure model to project Mean Time to Failure for various GPU scales. We further propose a method to estimate a related metric, Effective Training Time Ratio, as a function of job parameters, and we use this model to gauge the efficacy of potential software mitigations at scale. Our work provides valuable insights and future research directions for improving the reliability of AI supercomputer clusters, emphasizing the need for flexible, workload-agnostic, and reliability-aware infrastructure, system software, and algorithms.

Revisiting Reliability in Large-Scale Machine Learning Research Clusters

TL;DR

This paper tackles reliability in large-scale, multi-tenant ML training clusters by presenting 11 months of fleet data across two state-of-the-art environments, introducing a formal failure taxonomy, and modeling Mean Time to Failure () as a function of GPU scale. It develops an analytical estimator for Effective Training Time Ratio () and demonstrates practical mitigations, including lemon-node detection and adaptive network routing, to boost training productivity. The contributions advance a workload-agnostic, reliability-aware approach to cluster design, fault detection, and recovery, with actionable guidance for reducing wasted goodput and accelerating large-scale model training. The work highlights the asymmetrical impact of reliability on large versus small jobs and emphasizes network and software co-design as essential for sustainable AI research at exascale-like scales.

Abstract

Reliability is a fundamental challenge in operating large-scale machine learning (ML) infrastructures, particularly as the scale of ML models and training clusters continues to grow. Despite decades of research on infrastructure failures, the impact of job failures across different scales remains unclear. This paper presents a view of managing two large, multi-tenant ML clusters, providing quantitative analysis, operational experience, and our own perspective in understanding and addressing reliability concerns at scale. Our analysis reveals that while large jobs are most vulnerable to failures, smaller jobs make up the majority of jobs in the clusters and should be incorporated into optimization objectives. We identify key workload properties, compare them across clusters, and demonstrate essential reliability requirements for pushing the boundaries of ML training at scale. We hereby introduce a taxonomy of failures and key reliability metrics, analyze 11 months of data from two state-of-the-art ML environments with 4 million jobs and over 150 million A100 GPU hours. Building on our data, we fit a failure model to project Mean Time to Failure for various GPU scales. We further propose a method to estimate a related metric, Effective Training Time Ratio, as a function of job parameters, and we use this model to gauge the efficacy of potential software mitigations at scale. Our work provides valuable insights and future research directions for improving the reliability of AI supercomputer clusters, emphasizing the need for flexible, workload-agnostic, and reliability-aware infrastructure, system software, and algorithms.

Paper Structure

This paper contains 15 sections, 11 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: System Overview of the Research Cluster.
  • Figure 2: The Network Topology of RSC-1 (similar for RSC-2).
  • Figure 3: Scheduler Job Status Breakdown by Number of Jobs and GPU Runtime on RSC-1.
  • Figure 4: Attributed hardware failures on RSC-1 and RSC-2 expressed with per-GPU hourly rate.
  • Figure 5: Evolution of cluster failure rate for RSC-1 and RSC-2 broken down by failure mode. Annotated vertical lines show the dates of introduction for various different health checks during the course of the year.
  • ...and 7 more figures