The DEEP-ER project: I/O and resiliency extensions for the Cluster-Booster architecture
Anke Kreuzer, Norbert Eicker, Jorge Amaya, Raphael Leger, Estela Suarez
TL;DR
The paper tackles Exascale I/O and resiliency challenges by extending the Cluster-Booster architecture with a multi-level memory hierarchy, including non-volatile memory and Network Attached Memory. It presents a cohesive software stack—centered on SCR-based checkpointing, SIONlib, BeeGFS/BeeOND, and an OmpSs abstraction layer—designed to preserve portability while boosting performance and fault tolerance. Co-design applications demonstrate meaningful improvements in I/O throughput, checkpoint overhead, and task resiliency, including NAM-based parity and advanced OmpSs resiliency features. Collectively, the work advances Modular Supercomputing concepts and informs future DEEP-EST platforms that integrate heterogeneous compute modules for scalable HPC and HPDA workloads.
Abstract
The recently completed research project DEEP-ER has developed a variety of hardware and software technologies to improve the I/O capabilities of next generation high-performance computers, and to enable applications recovering from the larger hardware failure rates expected on these machines. The heterogeneous Cluster-Booster architecture --first introduced in the predecessor DEEP project-- has been extended by a multi-level memory hierarchy employing non-volatile and network-attached memory devices. Based on this hardware infrastructure, an I/O and resiliency software stack has been implemented combining and extending well established libraries and software tools, and sticking to standard user-interfaces. Real-world scientific codes have tested the projects' developments and demonstrated the improvements achieved without compromising the portability of the applications.
