Table of Contents
Fetching ...

A Digital Twin Framework for Liquid-cooled Supercomputers as Demonstrated at Exascale

Wesley Brewer, Matthias Maiterth, Vineet Kumar, Rafal Wojda, Sedrick Bouknight, Jesse Hines, Woong Shin, Scott Greenwood, David Grant, Wesley Williams, Feiyi Wang

TL;DR

The digital twin will be a key enabler for sustainable, energy-efficient supercomputing and benefit HPC practitioners developing similar digital twins, according to this paper.

Abstract

We present ExaDigiT, an open-source framework for developing comprehensive digital twins of liquid-cooled supercomputers. It integrates three main modules: (1) a resource allocator and power simulator, (2) a transient thermo-fluidic cooling model, and (3) an augmented reality model of the supercomputer and central energy plant. The framework enables the study of "what-if" scenarios, system optimizations, and virtual prototyping of future systems. Using Frontier as a case study, we demonstrate the framework's capabilities by replaying six months of system telemetry for systematic verification and validation. Such a comprehensive analysis of a liquid-cooled exascale supercomputer is the first of its kind. ExaDigiT elucidates complex transient cooling system dynamics, runs synthetic or real workloads, and predicts energy losses due to rectification and voltage conversion. Throughout our paper, we present lessons learned to benefit HPC practitioners developing similar digital twins. We envision the digital twin will be a key enabler for sustainable, energy-efficient supercomputing.

A Digital Twin Framework for Liquid-cooled Supercomputers as Demonstrated at Exascale

TL;DR

The digital twin will be a key enabler for sustainable, energy-efficient supercomputing and benefit HPC practitioners developing similar digital twins, according to this paper.

Abstract

We present ExaDigiT, an open-source framework for developing comprehensive digital twins of liquid-cooled supercomputers. It integrates three main modules: (1) a resource allocator and power simulator, (2) a transient thermo-fluidic cooling model, and (3) an augmented reality model of the supercomputer and central energy plant. The framework enables the study of "what-if" scenarios, system optimizations, and virtual prototyping of future systems. Using Frontier as a case study, we demonstrate the framework's capabilities by replaying six months of system telemetry for systematic verification and validation. Such a comprehensive analysis of a liquid-cooled exascale supercomputer is the first of its kind. ExaDigiT elucidates complex transient cooling system dynamics, runs synthetic or real workloads, and predicts energy losses due to rectification and voltage conversion. Throughout our paper, we present lessons learned to benefit HPC practitioners developing similar digital twins. We envision the digital twin will be a key enabler for sustainable, energy-efficient supercomputing.
Paper Structure (28 sections, 7 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: architectural overview.
  • Figure 2: Relationships between levels.
  • Figure 3: Frontier rack-level power distribution and voltage conversion.
  • Figure 4: Frontier power utilization breakdown based on peak CPU/GPU utilization of its 9472 nodes.
  • Figure 5: Simplified schematic of Frontier cooling system with enumerated locations where the cooling model predicts pressures, temperatures, and flow rates.
  • ...and 4 more figures