Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 GPUs
Shengkun Cui, Archit Patke, Hung Nguyen, Aditya Ranjan, Ziheng Chen, Phuong Cao, Gregory Bauer, Brett Bode, Catello Di Martino, Saurabh Jha, Chandra Narayanaswami, Daby Sow, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
TL;DR
Delta's resilience study analyzes A100 and H100 GPUs in a large-scale AI/HPC cluster using 2.5 years of error data to map memory, hardware, and NVLink failures to user-job impact. It reveals that H100 memory resilience is lower per-GPU due to higher capacity, while hardware resilience improves relative to A100, driven by tighter CPU-GPU integration and driver enhancements. Application-level recovery proves largely ineffective, with significant node downtime and a recommendation of about 5% overprovisioning to maintain 99.9% job availability, albeit at substantial cost. The work offers a data-driven view of modern GPU reliability and informs design and operational strategies for resilient HPC/ML infrastructures.
Abstract
This study characterizes GPU resilience in Delta, a large-scale AI system that consists of 1,056 A100 and H100 GPUs, with over 1,300 petaflops of peak throughput. We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors. Our major findings include: (i) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors, (ii) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity, (iii) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components, (iv) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level, and (v) We project the impact of GPU node availability on larger-scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.
