Table of Contents
Fetching ...

The DEEP-ER project: I/O and resiliency extensions for the Cluster-Booster architecture

Anke Kreuzer, Norbert Eicker, Jorge Amaya, Raphael Leger, Estela Suarez

TL;DR

The paper tackles Exascale I/O and resiliency challenges by extending the Cluster-Booster architecture with a multi-level memory hierarchy, including non-volatile memory and Network Attached Memory. It presents a cohesive software stack—centered on SCR-based checkpointing, SIONlib, BeeGFS/BeeOND, and an OmpSs abstraction layer—designed to preserve portability while boosting performance and fault tolerance. Co-design applications demonstrate meaningful improvements in I/O throughput, checkpoint overhead, and task resiliency, including NAM-based parity and advanced OmpSs resiliency features. Collectively, the work advances Modular Supercomputing concepts and informs future DEEP-EST platforms that integrate heterogeneous compute modules for scalable HPC and HPDA workloads.

Abstract

The recently completed research project DEEP-ER has developed a variety of hardware and software technologies to improve the I/O capabilities of next generation high-performance computers, and to enable applications recovering from the larger hardware failure rates expected on these machines. The heterogeneous Cluster-Booster architecture --first introduced in the predecessor DEEP project-- has been extended by a multi-level memory hierarchy employing non-volatile and network-attached memory devices. Based on this hardware infrastructure, an I/O and resiliency software stack has been implemented combining and extending well established libraries and software tools, and sticking to standard user-interfaces. Real-world scientific codes have tested the projects' developments and demonstrated the improvements achieved without compromising the portability of the applications.

The DEEP-ER project: I/O and resiliency extensions for the Cluster-Booster architecture

TL;DR

The paper tackles Exascale I/O and resiliency challenges by extending the Cluster-Booster architecture with a multi-level memory hierarchy, including non-volatile memory and Network Attached Memory. It presents a cohesive software stack—centered on SCR-based checkpointing, SIONlib, BeeGFS/BeeOND, and an OmpSs abstraction layer—designed to preserve portability while boosting performance and fault tolerance. Co-design applications demonstrate meaningful improvements in I/O throughput, checkpoint overhead, and task resiliency, including NAM-based parity and advanced OmpSs resiliency features. Collectively, the work advances Modular Supercomputing concepts and informs future DEEP-EST platforms that integrate heterogeneous compute modules for scalable HPC and HPDA workloads.

Abstract

The recently completed research project DEEP-ER has developed a variety of hardware and software technologies to improve the I/O capabilities of next generation high-performance computers, and to enable applications recovering from the larger hardware failure rates expected on these machines. The heterogeneous Cluster-Booster architecture --first introduced in the predecessor DEEP project-- has been extended by a multi-level memory hierarchy employing non-volatile and network-attached memory devices. Based on this hardware infrastructure, an I/O and resiliency software stack has been implemented combining and extending well established libraries and software tools, and sticking to standard user-interfaces. Real-world scientific codes have tested the projects' developments and demonstrated the improvements achieved without compromising the portability of the applications.

Paper Structure

This paper contains 19 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Cluster-Booster architecture in DEEP-ER.
  • Figure 2: Network Attached Memory (NAM) board.
  • Figure 3: RMA benchmarks (bandwidth and latency) on the NAM. Best values achievable with Extoll are also plotted.
  • Figure 4: N-body code testing various checkpointing strategies on the DEEP-ER Cluster (weak scaling).
  • Figure 5: I/O improvement through SIONlib measured with GERShWIN.
  • ...and 5 more figures