Table of Contents
Fetching ...

Scaling MPI Applications on Aurora

Huda Ibeid, Anthony-Trung Nguyen, Aditya Nishtala, Premanand Sakarda, Larry Kaplan, Nilakantan Mahadevan, Michael Woodacre, Victor Anisimov, Kalyan Kumaran, JaeHyuk Kwack, Vitali Morozov, Servesh Muralidharan, Scott Parker

TL;DR

Aurora demonstrates a comprehensive exascale HPC/AI platform built around the HPE Slingshot fabric and Intel Ponte Vecchio GPUs. The paper details the network architecture, topology, management, and rigorous multi-level validation that enable scalable, low-latency communication across thousands of nodes. Benchmark and application results (HPL, HPL-MxP, Graph500, HPCG, HACC, Nekbone, AMR-Wind, LAMMPS, FMM) illustrate exascale performance and strong scaling to large fractions of the system, validating practical impact for open science. Collectively, the work establishes Aurora as a leading exascale system with demonstrated throughput, latency, and bandwidth suitable for breakthrough scientific computations.

Abstract

The Aurora supercomputer, which was deployed at Argonne National Laboratory in 2024, is currently one of three Exascale machines in the world on the Top500 list. The Aurora system is composed of over ten thousand nodes each of which contains six Intel Data Center Max Series GPUs, Intel's first data center-focused discrete GPU, and two Intel Xeon Max Series CPUs, Intel's first Xeon processor to contain HBM memory. To achieve Exascale performance the system utilizes the HPE Slingshot high-performance fabric interconnect to connect the nodes. Aurora is currently the largest deployment of the Slingshot fabric to date with nearly 85,000 Cassini NICs and 5,600 Rosetta switches connected in a dragonfly topology. The combination of the Intel powered nodes and the Slingshot network enabled Aurora to become the second fastest system on the Top500 list in June of 2024 and the fastest system on the HPL MxP benchmark. The system is one of the most powerful systems in the world dedicated to AI and HPC simulations for open science. This paper presents details of the Aurora system design with a particular focus on the network fabric and the approach taken to validating it. The performance of the systems is demonstrated through the presentation of the results of MPI benchmarks as well as performance benchmarks including HPL, HPL-MxP, Graph500, and HPCG run on a large fraction of the system. Additionally results are presented for a diverse set of applications including HACC, AMR-Wind, LAMMPS, and FMM demonstrating that Aurora provides the throughput, latency, and bandwidth across system needed to allow applications to perform and scale to large node counts and providing new levels of capability and enabling breakthrough science.

Scaling MPI Applications on Aurora

TL;DR

Aurora demonstrates a comprehensive exascale HPC/AI platform built around the HPE Slingshot fabric and Intel Ponte Vecchio GPUs. The paper details the network architecture, topology, management, and rigorous multi-level validation that enable scalable, low-latency communication across thousands of nodes. Benchmark and application results (HPL, HPL-MxP, Graph500, HPCG, HACC, Nekbone, AMR-Wind, LAMMPS, FMM) illustrate exascale performance and strong scaling to large fractions of the system, validating practical impact for open science. Collectively, the work establishes Aurora as a leading exascale system with demonstrated throughput, latency, and bandwidth suitable for breakthrough scientific computations.

Abstract

The Aurora supercomputer, which was deployed at Argonne National Laboratory in 2024, is currently one of three Exascale machines in the world on the Top500 list. The Aurora system is composed of over ten thousand nodes each of which contains six Intel Data Center Max Series GPUs, Intel's first data center-focused discrete GPU, and two Intel Xeon Max Series CPUs, Intel's first Xeon processor to contain HBM memory. To achieve Exascale performance the system utilizes the HPE Slingshot high-performance fabric interconnect to connect the nodes. Aurora is currently the largest deployment of the Slingshot fabric to date with nearly 85,000 Cassini NICs and 5,600 Rosetta switches connected in a dragonfly topology. The combination of the Intel powered nodes and the Slingshot network enabled Aurora to become the second fastest system on the Top500 list in June of 2024 and the fastest system on the HPL MxP benchmark. The system is one of the most powerful systems in the world dedicated to AI and HPC simulations for open science. This paper presents details of the Aurora system design with a particular focus on the network fabric and the approach taken to validating it. The performance of the systems is demonstrated through the presentation of the results of MPI benchmarks as well as performance benchmarks including HPL, HPL-MxP, Graph500, and HPCG run on a large fraction of the system. Additionally results are presented for a diverse set of applications including HACC, AMR-Wind, LAMMPS, and FMM demonstrating that Aurora provides the throughput, latency, and bandwidth across system needed to allow applications to perform and scale to large node counts and providing new levels of capability and enabling breakthrough science.

Paper Structure

This paper contains 46 sections, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Aurora Node
  • Figure 2: Aurora Fabric Topology.
  • Figure 3: Aurora Fabric Validation.
  • Figure 4: Fabric validation using an all2all communication benchmark across 9,658 nodes (77,264 NICs) with 16 processes per node (PPN=16)
  • Figure 5: Fabric Validation using GPCNet network test
  • ...and 15 more figures