Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC
Madan Timalsina, Lisa Gerhardt, Nicholas Tyler, Johannes P. Blaschke, William Arndt
TL;DR
This work evaluates DMTCP-based checkpoint-restart (C/R) in HPC environments, focusing on containerized and non-containerized workflows on NERSC Perlmutter. It analyzes how HPC container runtimes (Shifter and Podman-HPC) interact with DMTCP, and presents automated and manual C/R strategies, including integration with SLURM. Results from Geant4-based simulations across versions and configurations show that C/R enables resume from checkpoints with modest overhead, reducing wasted compute time due to preemption. The study advances practical HPC methodologies by demonstrating robust, container-friendly C/R workflows that improve reliability and scheduling efficiency across diverse scientific domains.
Abstract
This paper presents an in-depth examination of checkpoint-restart mechanisms in High-Performance Computing (HPC). It focuses on the use of Distributed MultiThreaded CheckPointing (DMTCP) in various computational settings, including both within and outside of containers. The study is grounded in real-world applications running on NERSC Perlmutter, a state-of-the-art supercomputing system. We discuss the advantages of checkpoint-restart (C/R) in managing complex and lengthy computations in HPC, highlighting its efficiency and reliability in such environments. The role of DMTCP in enhancing these workflows, especially in multi-threaded and distributed applications, is thoroughly explored. Additionally, the paper delves into the use of HPC containers, such as Shifter and Podman-HPC, which aid in the management of computational tasks, ensuring uniform performance across different environments. The methods, results, and potential future directions of this research, including its application in various scientific domains, are also covered, showcasing the critical advancements made in computational methodologies through this study.
