Table of Contents
Fetching ...

Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 GPUs

Shengkun Cui, Archit Patke, Hung Nguyen, Aditya Ranjan, Ziheng Chen, Phuong Cao, Gregory Bauer, Brett Bode, Catello Di Martino, Saurabh Jha, Chandra Narayanaswami, Daby Sow, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

TL;DR

Delta's resilience study analyzes A100 and H100 GPUs in a large-scale AI/HPC cluster using 2.5 years of error data to map memory, hardware, and NVLink failures to user-job impact. It reveals that H100 memory resilience is lower per-GPU due to higher capacity, while hardware resilience improves relative to A100, driven by tighter CPU-GPU integration and driver enhancements. Application-level recovery proves largely ineffective, with significant node downtime and a recommendation of about 5% overprovisioning to maintain 99.9% job availability, albeit at substantial cost. The work offers a data-driven view of modern GPU reliability and informs design and operational strategies for resilient HPC/ML infrastructures.

Abstract

This study characterizes GPU resilience in Delta, a large-scale AI system that consists of 1,056 A100 and H100 GPUs, with over 1,300 petaflops of peak throughput. We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors. Our major findings include: (i) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors, (ii) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity, (iii) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components, (iv) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level, and (v) We project the impact of GPU node availability on larger-scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.

Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 GPUs

TL;DR

Delta's resilience study analyzes A100 and H100 GPUs in a large-scale AI/HPC cluster using 2.5 years of error data to map memory, hardware, and NVLink failures to user-job impact. It reveals that H100 memory resilience is lower per-GPU due to higher capacity, while hardware resilience improves relative to A100, driven by tighter CPU-GPU integration and driver enhancements. Application-level recovery proves largely ineffective, with significant node downtime and a recommendation of about 5% overprovisioning to maintain 99.9% job availability, albeit at substantial cost. The work offers a data-driven view of modern GPU reliability and informs design and operational strategies for resilient HPC/ML infrastructures.

Abstract

This study characterizes GPU resilience in Delta, a large-scale AI system that consists of 1,056 A100 and H100 GPUs, with over 1,300 petaflops of peak throughput. We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors. Our major findings include: (i) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors, (ii) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity, (iii) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components, (iv) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level, and (v) We project the impact of GPU node availability on larger-scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.

Paper Structure

This paper contains 24 sections, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: A double-bit memory error (XID 48) occurred, and it is uncorrectable by the SECDED ECC HBM3 memory. Due to this double-bit error, the user job scheduled on that GPU failed, as reflected in the scheduler logs. Subsequently, this uncorrectable memory error requires a node draining and reset to complete the row remapping recovery action. The total recovery process for this incident took 19 hours, during which the node was unavailable for accepting new jobs. This incident shows that a GPU error can lead to user job failures and significantly impact node availability.
  • Figure 2: System architecture and specifications of Delta. This study focuses on the H100 and A100 GPU nodes.
  • Figure 3: NVIDIA memory error recovery process for A100 and H100 GPUs.
  • Figure 4: Overview of our data collection, processing, and analysis pipeline.
  • Figure 5: Intra-GPU uncorrectable memory error recovery paths in H100 GPUs. Numbers on the edges show propagation probability. The precise sub-second timing information is not available in the H100 nodes' system logs. All propagation time for H100 GPUs memory errors were within one second.
  • ...and 5 more figures