Efficient N-to-M Checkpointing Algorithm for Finite Element Simulations
David A. Ham, Vaclav Hapla, Matthew G. Knepley, Lawrence Mitchell, Koki Sagiyama
TL;DR
This work develops an efficient N-to-M checkpointing algorithm for finite element simulations, enabling saving and loading of meshes, function spaces, and functions across different parallel process counts. By representing inter-process mappings as star forests and implementing the workflow in PETSc and Firedrake (via a new CheckpointFile API), the approach reconstructs functions on loaded meshes regardless of arbitrary redistributions. The method is validated for correctness across multiple FE families and dimensions, and scales on ARCHER2 up to billions of DoFs, with detailed I/O performance analysis. The practical impact is enabling flexible, multi-session and post-processing workflows for large-scale FEM simulations without constraining save/load process counts. Future work aims to ensure loaded meshes inherit global numbering to preserve exact loading distributions for repeated checkpointing.
Abstract
In this work, we introduce a new algorithm for N-to-M checkpointing in finite element simulations. This new algorithm allows efficient saving/loading of functions representing physical quantities associated with the mesh representing the physical domain. Specifically, the algorithm allows for using different numbers of parallel processes for saving and loading, allowing for restarting and post-processing on the process count appropriate to the given phase of the simulation and other conditions. For demonstration, we implemented this algorithm in PETSc, the Portable, Extensible Toolkit for Scientific Computation, and added a convenient high-level interface into Firedrake, a system for solving partial differential equations using finite element methods. We evaluated our new implementation by saving and loading data involving 8.2 billion finite element degrees of freedom using 8,192 parallel processes on ARCHER2, the UK National Supercomputing Service.
