Table of Contents
Fetching ...

Aurora: Architecting Argonne's First Exascale Supercomputer for Accelerated Scientific Discovery

William E. Allcock, Benjamin S. Allen, James Anchell, Victor Anisimov, Thomas Applencourt, Abhishek Bagusetty, Ramesh Balakrishnan, Riccardo Balin, Solomon Bekele, Colleen Bertoni, Cyrus Blackworth, Renzo Bustamante, Kevin Canada, John Carrier, Christopher Chan-nui, Lance C. Cheney, Taylor Childers, Paul Coffman, Susan Coghlan, Tanima Dey, Michael D'Mello, Ashok Emani, Murali Emani, Kyle G. Felker, Sam Foreman, Olivier Franza, Longfei Gao, Marta García, María Garzarán, Balazs Gerofi, Yasaman Ghadar, Subrata Goswami, Neha Gupta, Kevin Harms, Väinö Hatanpää, Brian Holland, Carissa Holohan, Brian Homerding, Khalid Hossain, Xue Hu, Louise Huot, Huda Ibeid, Joseph A. Insley, Sai Jayanthi, Hong Jiang, Wei Jiang, Xiao-Yong Jin, Jeongnim Kim, Christopher Knight, Panagiotis Kourdis, Kalyan Kumaran, JaeHyuk Kwack, Janghaeng Lee, Ti Leggett, Ben Lenard, Chris Lewis, Nevin Liber, Johann Lombardi, Raymond M. Loy, Ye Luo, Bethany Lusch, Nilakantan Mahadevan, Beth Markey, Victor A. Mateevitsi, Gordon McPheeters, Ryan Milner, Jerome Mitchell, Vitali A. Morozov, Servesh Muralidharan, Tom Musta, Mrigendra Nagar, Vikram Narayana, Marieme Ngom, Anthony-Trung Nguyen, Nathan Nichols, Aditya Nishtala, James C. Osborn, Michael E. Papka, Scott Parker, Saumil S. Patel, Julia Piotrowska, Adrian C. Pope, Sucheta Raghunanda, Esteban Rangel, Paul M. Rich, Katherine M. Riley, Silvio Rizzi, Kris Rowe, Varuni Sastry, Adam Scovel, Filippo Simini, Haritha Siddabathuni Som, Patrick Steinbrecher, Rick Stevens, Xinmin Tian, Peter Upton, Thomas Uram, Archit K. Vasan, Álvaro Vázquez-Mayagoitia, Kaushik Velusamy, Brice Videau, Venkatram Vishwanath, Brian Whitney, Timothy J. Williams, Michael Woodacre, Sam Zeltner, Chuanjun Zhang, Gengbin Zheng, Huihuo Zheng

TL;DR

Aurora outlines Argonne's exascale system design, integrating Intel Xeon Max SPR CPUs, Ponte Vecchio GPUs, Slingshot-11 interconnect, and the DAOS storage stack to accelerate HPC, data analytics, and AI. It details ECB architecture, dragonfly-based scale-out topology, and a software ecosystem built on oneAPI with SYCL/OpenMP/OpenCL, HIP (experimental), and portability layers (Kokkos/RAJA). The paper presents node-level and scalable benchmark results (HPL, HPLMxP, MPICH, oneCCL) and early applications performance, demonstrating substantial readiness and production potential, while discussing reliability, fault-tolerance, and power-management challenges. The integration of DAOS with Lustre and advanced AI/data tooling positions Aurora to enable large-scale, data-driven discovery across multiple scientific pillars. Overall, Aurora represents a foundational step toward exascale scientific discovery, with demonstrated scalability and a robust software ecosystem enabling broad, portable workloads.

Abstract

Aurora is Argonne National Laboratory's pioneering Exascale supercomputer, designed to accelerate scientific discovery with cutting-edge architectural innovations. Key new technologies include the Intel(TM) Xeon(TM) Data Center GPU Max Series (code-named Sapphire Rapids) with support for High Bandwidth Memory (HBM), alongside the Intel(TM) Data Center GPU Max Series (code-named Ponte Vecchio) on each compute node. Aurora also integrates the Distributed Asynchronous Object Storage (DAOS), a novel exascale storage solution, and leverages Intel's oneAPI programming environment. This paper presents an in-depth exploration of Aurora's node architecture, the HPE Slingshot interconnect, the supporting software ecosystem, and DAOS. We provide insights into standard benchmark performance and applications readiness efforts via Aurora's Early Science Program and the Exascale Computing Project.

Aurora: Architecting Argonne's First Exascale Supercomputer for Accelerated Scientific Discovery

TL;DR

Aurora outlines Argonne's exascale system design, integrating Intel Xeon Max SPR CPUs, Ponte Vecchio GPUs, Slingshot-11 interconnect, and the DAOS storage stack to accelerate HPC, data analytics, and AI. It details ECB architecture, dragonfly-based scale-out topology, and a software ecosystem built on oneAPI with SYCL/OpenMP/OpenCL, HIP (experimental), and portability layers (Kokkos/RAJA). The paper presents node-level and scalable benchmark results (HPL, HPLMxP, MPICH, oneCCL) and early applications performance, demonstrating substantial readiness and production potential, while discussing reliability, fault-tolerance, and power-management challenges. The integration of DAOS with Lustre and advanced AI/data tooling positions Aurora to enable large-scale, data-driven discovery across multiple scientific pillars. Overall, Aurora represents a foundational step toward exascale scientific discovery, with demonstrated scalability and a robust software ecosystem enabling broad, portable workloads.

Abstract

Aurora is Argonne National Laboratory's pioneering Exascale supercomputer, designed to accelerate scientific discovery with cutting-edge architectural innovations. Key new technologies include the Intel(TM) Xeon(TM) Data Center GPU Max Series (code-named Sapphire Rapids) with support for High Bandwidth Memory (HBM), alongside the Intel(TM) Data Center GPU Max Series (code-named Ponte Vecchio) on each compute node. Aurora also integrates the Distributed Asynchronous Object Storage (DAOS), a novel exascale storage solution, and leverages Intel's oneAPI programming environment. This paper presents an in-depth exploration of Aurora's node architecture, the HPE Slingshot interconnect, the supporting software ecosystem, and DAOS. We provide insights into standard benchmark performance and applications readiness efforts via Aurora's Early Science Program and the Exascale Computing Project.

Paper Structure

This paper contains 43 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Aurora's first row of cabinets at the ALCF.
  • Figure 2: Intel® Xeon Max Series CPU with HBM.
  • Figure 3: Intel® Data Center GPU Max.
  • Figure 4: Aurora Exascale Compute Blade (ECB).
  • Figure 5: Physical Aurora Exascale Compute Blade (ECB).
  • ...and 5 more figures