Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach

Yao Xu; Gene Cooperman

Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach

Yao Xu, Gene Cooperman

TL;DR

The paper tackles practical, transparent checkpointing of MPI on modern interconnects by eliminating the high runtime overhead of prior approaches. It introduces the Collective Clock (CC) algorithm, a topological-sort style method that uses per-group clocks $SEQ[ggid]$ and $TARGET[ggid]$ to determine a safe checkpoint point without inserting global barriers, and extends support to non-blocking collectives. Extensive micro-benchmarks and real-world tests on Perlmutter, including VASP workloads, show runtimes overhead typically between $0\%$ and $5\%$, greatly improving over the previous 2PC approach. The work demonstrates that CC enables practical resilience for HPC workloads with frequent collective operations and high-rate networks, while remaining open-source for broad adoption.

Abstract

MPI is the de facto standard for parallel computing on a cluster of computers. Checkpointing is an important component in any strategy for software resilience and for long-running jobs that must be executed by chaining together time-bounded resource allocations. This work solves an old problem: a practical and general algorithm for transparent checkpointing of MPI that is both efficient and compatible with most of the latest network software. Transparent checkpointing is attractive due to its generality and ease of use for most MPI application developers. Earlier efforts at transparent checkpointing for MPI, one decade ago, had two difficult problems: (i) by relying on a specific MPI implementation tied to a specific network technology; and (ii) by failing to demonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's introduction of split processes. Problem (ii) (efficient runtime overhead) is solved in this work. This paper introduces an approach that avoids these limitations, employing a novel topological sort to algorithmically determine a safe future synchronization point. The algorithm is valid for both blocking and non-blocking collective communication in MPI. We demonstrate the efficacy and scalability of our approach through both micro-benchmarks and a set of five real-world MPI applications, notably including the widely used VASP (Vienna Ab Initio Simulation Package), which is responsible for 11% of the workload on the Perlmutter supercomputer at Lawrence Berkley National Laboratory. VASP was previously cited as a special challenge for checkpointing, in part due to its multi-algorithm codes.

Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach

TL;DR

and

to determine a safe checkpoint point without inserting global barriers, and extends support to non-blocking collectives. Extensive micro-benchmarks and real-world tests on Perlmutter, including VASP workloads, show runtimes overhead typically between

and

, greatly improving over the previous 2PC approach. The work demonstrates that CC enables practical resilience for HPC workloads with frequent collective operations and high-rate networks, while remaining open-source for broad adoption.

Abstract

Paper Structure (27 sections, 9 figures, 1 table, 3 algorithms)

This paper contains 27 sections, 9 figures, 1 table, 3 algorithms.

Introduction
Points of Novelty
Organization of Paper
Background
Review of MPI
MANA's Split Process Software Architecture and Checkpointing
A Close Look at the MPI Standard
Collective Clock (CC) Algorithm
Definitions
The CC Algorithm for Blocking Collective calls
CC algorithm at runtime
CC algorithm at checkpoint time
The CC pseudo-code (blocking collective calls)
CC at checkpoint time: An example
Blocking collective calls and point-to-point calls
...and 12 more sections

Figures (9)

Figure 1: Split Process Achitecture
Figure 2: Examples of the CC algorithm at checkpoint time. The execution of MPI communications is viewed as a directed graph. Each node corresponds to a collective communication. Each edge is labeled by an MPI process participating in the communication. Solid incoming edges indicate processes that already visited the node, whereas dotted incoming edges indicate future executions. In Figure (a), Condition $A$ is applied once for $P2$ to continue executing. In Figure (b), $P2$ discovers the intermediate node $N5$, and so Condition $A$ is applied twice for $P2$ and once for $P4$.
Figure 3: The two figures are examples of snapshots in time. An arrowhead in the timeline of an MPI process in Figure \ref{['fig:seq-num-algo-simple']} indicates the current point in time at which the checkpoint request arrived. A solid vertical line is in the past and a dashed vertical line is in the future of the MPI process. A dashed vertical line terminates at the collective operation that is a target for the given process. A horizontal line indicates a (blocking) collective operation (a node when viewing this as a directed graph). The number to the right of each collective operation is the sequence number assigned for that ggid (for the set of ranks of the group of that operation).
Figure 4: The two cases above do not occur in a correct MPI program.
Figure 5: Runtime overhead on Micro-Benchmarks for CC and 2PC. Note that 2PC is not shown for non-blocking functions since 2PC does not support such calls.
...and 4 more figures

Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach

TL;DR

Abstract

Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (9)