Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC

Madan Timalsina; Lisa Gerhardt; Nicholas Tyler; Johannes P. Blaschke; William Arndt

Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC

Madan Timalsina, Lisa Gerhardt, Nicholas Tyler, Johannes P. Blaschke, William Arndt

TL;DR

This work evaluates DMTCP-based checkpoint-restart (C/R) in HPC environments, focusing on containerized and non-containerized workflows on NERSC Perlmutter. It analyzes how HPC container runtimes (Shifter and Podman-HPC) interact with DMTCP, and presents automated and manual C/R strategies, including integration with SLURM. Results from Geant4-based simulations across versions and configurations show that C/R enables resume from checkpoints with modest overhead, reducing wasted compute time due to preemption. The study advances practical HPC methodologies by demonstrating robust, container-friendly C/R workflows that improve reliability and scheduling efficiency across diverse scientific domains.

Abstract

This paper presents an in-depth examination of checkpoint-restart mechanisms in High-Performance Computing (HPC). It focuses on the use of Distributed MultiThreaded CheckPointing (DMTCP) in various computational settings, including both within and outside of containers. The study is grounded in real-world applications running on NERSC Perlmutter, a state-of-the-art supercomputing system. We discuss the advantages of checkpoint-restart (C/R) in managing complex and lengthy computations in HPC, highlighting its efficiency and reliability in such environments. The role of DMTCP in enhancing these workflows, especially in multi-threaded and distributed applications, is thoroughly explored. Additionally, the paper delves into the use of HPC containers, such as Shifter and Podman-HPC, which aid in the management of computational tasks, ensuring uniform performance across different environments. The methods, results, and potential future directions of this research, including its application in various scientific domains, are also covered, showcasing the critical advancements made in computational methodologies through this study.

Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC

TL;DR

Abstract

Paper Structure (17 sections, 4 figures)

This paper contains 17 sections, 4 figures.

Introduction
Checkpoint-Restart
Distributed MultiThreaded CheckPointing (DMTCP)
How does DMTCP work?
National Energy Research Scientific Computing Center (NERSC)’s HPC Containers
Shifter
Podman-hpc
Performance Benchmarking of HPC Containers at NERSC
Methods
On NERSC Perlmutter
At NERSC Perlmutter inside the Containers
Automated C/R Strategies
Manual C/R Strategies
Results
Future Directions
...and 2 more sections

Figures (4)

Figure 1: Diagram illustrating the Distributed MultiThreaded CheckPointing (DMTCP) system with a central coordinator managing checkpoint messages (CKPT MSG) with three user processes. Each process contains a checkpoint thread (CKPT Thread) and user threads (Thread a/b/c/d/e/f), interconnected via socket connections. Signals (SIGTERM) are also shown, indicating the communication between threads and the checkpointing mechanism. Upon receiving a CKPT MSG from the central coordinator, the checkpoint threads trigger a signal to user threads, and a checkpointing action is initiated, which involves saving the current state of the processes.
Figure 2: Mean execution time of from mpi4py import MPI as a function of number of MPI ranks, and location of Python environment (based on benchmark from 9651304). Lines represent mean over multiple runs and ranks. This benchmark is collected on a Perlmutter CPU node, with up to 128 ranks per node. Correspondingly, we see that import times rapidly at around 128 ranks. Colored lines represent different file systems that the Python environment is located on. The "NERSC module" is installed to /global/common/software, which is optimized to allow for highly parallel loading and linking of shared libraries. shifter and podman-hpc correspond to the two container runtime environments available on NERSC's Perlmutter system. podman-hpc's performance at scale is comparable with the highly-optimized file systems (HOME, SCRATCH, and /global/common/software), whereas shifter out-performs all others.
Figure 3: Operational workflow of automated job management in the NERSC containerized HPC environments. This figure delineates the end-to-end process flow within a containerized HPC environment, encapsulating job submission, execution, checkpointing, signal trapping, and the conditional logic for job restarting or completion. It serves as an illustrative guide to the DMTCP-enabled job resubmission mechanism.
Figure 4: Comparative analysis of memory and CPU utilization over time at NERSC Perlmutter for computational processes using different strategies: without checkpoint-restart (top), checkpoint-only (middle), and with checkpoint-restart (bottom) within shifter container.

Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC

TL;DR

Abstract

Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC

Authors

TL;DR

Abstract

Table of Contents

Figures (4)