Table of Contents
Fetching ...

Software engineering to sustain a high-performance computing scientific application: QMCPACK

William F. Godoy, Steven E. Hahn, Michael M. Walsh, Philip W. Fackler, Jaron T. Krogel, Peter W. Doak, Paul R. C. Kent, Alfredo A. Correa, Ye Luo, Mark Dewing

TL;DR

The paper addresses the sustainability challenges of a mature HPC scientific code, QMCPACK, by applying research software engineering practices. It outlines a cohesive program—Docker-based reproducibility, GitHub Actions CI for CPU/GPU, memory-safety sanitization, and targeted refactoring (checkpoint/restart, legacy GPU removal, test coverage, and input validation)—to shift from reactive maintenance to predictive quality assurance. Empirical metrics show increased code coverage (38% to 52%), significant code-base reductions (nearly 40K lines removed), and deeper validation across CPU and GPU configurations, illustrating tangible gains in reliability and maintainability. The work demonstrates how structured RSE practices can enable exascale readiness and more rapid scientific discovery on diverse HPC platforms.

Abstract

We provide an overview of the software engineering efforts and their impact in QMCPACK, a production-level ab-initio Quantum Monte Carlo open-source code targeting high-performance computing (HPC) systems. Aspects included are: (i) strategic expansion of continuous integration (CI) targeting CPUs, using GitHub Actions runners, and NVIDIA and AMD GPUs in pre-exascale systems, using self-hosted hardware; (ii) incremental reduction of memory leaks using sanitizers, (iii) incorporation of Docker containers for CI and reproducibility, and (iv) refactoring efforts to improve maintainability, testing coverage, and memory lifetime management. We quantify the value of these improvements by providing metrics to illustrate the shift towards a predictive, rather than reactive, sustainable maintenance approach. Our goal, in documenting the impact of these efforts on QMCPACK, is to contribute to the body of knowledge on the importance of research software engineering (RSE) for the sustainability of community HPC codes and scientific discovery at scale.

Software engineering to sustain a high-performance computing scientific application: QMCPACK

TL;DR

The paper addresses the sustainability challenges of a mature HPC scientific code, QMCPACK, by applying research software engineering practices. It outlines a cohesive program—Docker-based reproducibility, GitHub Actions CI for CPU/GPU, memory-safety sanitization, and targeted refactoring (checkpoint/restart, legacy GPU removal, test coverage, and input validation)—to shift from reactive maintenance to predictive quality assurance. Empirical metrics show increased code coverage (38% to 52%), significant code-base reductions (nearly 40K lines removed), and deeper validation across CPU and GPU configurations, illustrating tangible gains in reliability and maintainability. The work demonstrates how structured RSE practices can enable exascale readiness and more rapid scientific discovery on diverse HPC platforms.

Abstract

We provide an overview of the software engineering efforts and their impact in QMCPACK, a production-level ab-initio Quantum Monte Carlo open-source code targeting high-performance computing (HPC) systems. Aspects included are: (i) strategic expansion of continuous integration (CI) targeting CPUs, using GitHub Actions runners, and NVIDIA and AMD GPUs in pre-exascale systems, using self-hosted hardware; (ii) incremental reduction of memory leaks using sanitizers, (iii) incorporation of Docker containers for CI and reproducibility, and (iv) refactoring efforts to improve maintainability, testing coverage, and memory lifetime management. We quantify the value of these improvements by providing metrics to illustrate the shift towards a predictive, rather than reactive, sustainable maintenance approach. Our goal, in documenting the impact of these efforts on QMCPACK, is to contribute to the body of knowledge on the importance of research software engineering (RSE) for the sustainability of community HPC codes and scientific discovery at scale.
Paper Structure (15 sections, 3 figures, 4 tables)