Table of Contents
Fetching ...

SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation

Yifan Xiong, Yuting Jiang, Ziyue Yang, Lei Qu, Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, Jithin Jose, Hossein Pourreza, Jeff Baxter, Kushal Datta, Prabhat Ram, Luke Melton, Joe Chau, Peng Cheng, Yongqiang Xiong, Lidong Zhou

TL;DR

Cloud AI infrastructure suffers from gray failures driven by hardware redundancies, which can degrade end-to-end performance and hinder root-cause analysis. SuperBench tackles this with a proactive validation framework consisting of a diverse benchmark set, a Validator to run tests, and a Selector to time and select benchmarks based on data-driven incident probabilities, continually evolving with gathered data. In simulations and real Azure deployments, SuperBench demonstrates substantial reliability gains (MTBI up to the reported 22.61x) and practical scalability, including identifying defective nodes and preserving high cluster utilization while reducing validation costs. The work provides a data-driven, extensible approach to preemptively restore redundancy health in cloud AI systems, with open-source benchmarks and deployment in production environments as evidence of practicality and impact.

Abstract

Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions. We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator which learns benchmark criteria to clearly pinpoint defective components. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61x. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs over the last two years.

SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation

TL;DR

Cloud AI infrastructure suffers from gray failures driven by hardware redundancies, which can degrade end-to-end performance and hinder root-cause analysis. SuperBench tackles this with a proactive validation framework consisting of a diverse benchmark set, a Validator to run tests, and a Selector to time and select benchmarks based on data-driven incident probabilities, continually evolving with gathered data. In simulations and real Azure deployments, SuperBench demonstrates substantial reliability gains (MTBI up to the reported 22.61x) and practical scalability, including identifying defective nodes and preserving high cluster utilization while reducing validation costs. The work provides a data-driven, extensible approach to preemptively restore redundancy health in cloud AI systems, with open-source benchmarks and deployment in production environments as evidence of practicality and impact.

Abstract

Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions. We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator which learns benchmark criteria to clearly pinpoint defective components. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61x. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs over the last two years.
Paper Structure (57 sections, 4 equations, 9 figures, 6 tables, 2 algorithms)

This paper contains 57 sections, 4 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: Percentage of infrastructure incidents' sources.
  • Figure 2: Incidents troubleshooting duration distribution.
  • Figure 3: Cumulative distribution of 2-node all-reduce bandwidth from a 24-node testbed with different redundancy ratios.
  • Figure 4: Left: mean time between $i$th incidents for nodes. Right: time to failure for jobs if all nodes in the job have $i$th incidents occurred.
  • Figure 5: GPU job percentage for diverse workloads.
  • ...and 4 more figures