Table of Contents
Fetching ...

ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies

Matteo Turilli, Mihael Hategan-Marandiuc, Mikhail Titov, Ketan Maheshwari, Aymen Alsaadi, Andre Merzky, Ramon Arambula, Mikhail Zakharchanka, Matt Cowan, Justin M. Wozniak, Andreas Wilke, Ozgur Ozan Kilic, Kyle Chard, Rafael Ferreira da Silva, Shantenu Jha, Daniel Laney

TL;DR

ExaWorks addresses the fragmentation of HPC workflow technologies by delivering a curated SDK that inter-operates across diverse workflow engines and middleware. The approach combines end-to-end capabilities with modular connectors, enabling integration with minimal code changes, reinforced by continuous integration, a public test dashboard, and dynamic, tutorial-based documentation. Through exemplar success stories—ranging from EnTK-based ExaAM UQ pipelines to Swift/T-driven CANDLE workloads and Colmena steering experiments—the work demonstrates scalable, real-world applicability on exascale platforms. The paper argues that sustained DOE support, together with an interoperable, well-tested toolkit, is essential to unlock scalable, portable scientific workflows across diverse DOE facilities and future exascale architectures.

Abstract

Scientific discovery increasingly requires executing heterogeneous scientific workflows on high-performance computing (HPC) platforms. Heterogeneous workflows contain different types of tasks (e.g., simulation, analysis, and learning) that need to be mapped, scheduled, and launched on different computing. That requires a software stack that enables users to code their workflows and automate resource management and workflow execution. Currently, there are many workflow technologies with diverse levels of robustness and capabilities, and users face difficult choices of software that can effectively and efficiently support their use cases on HPC machines, especially when considering the latest exascale platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK). The SDK is a curated collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms. We present our experience with (1) curating those technologies, (2) integrating them to provide users with new capabilities, (3) developing a continuous integration platform to test the SDK on DOE HPC platforms, (4) designing a dashboard to publish the results of those tests, and (5) devising an innovative documentation platform to help users to use those technologies. Our experience details the requirements and the best practices needed to curate workflow technologies, and it also serves as a blueprint for the capabilities and services that DOE will have to offer to support a variety of scientific heterogeneous workflows on the newly available exascale HPC platforms.

ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies

TL;DR

ExaWorks addresses the fragmentation of HPC workflow technologies by delivering a curated SDK that inter-operates across diverse workflow engines and middleware. The approach combines end-to-end capabilities with modular connectors, enabling integration with minimal code changes, reinforced by continuous integration, a public test dashboard, and dynamic, tutorial-based documentation. Through exemplar success stories—ranging from EnTK-based ExaAM UQ pipelines to Swift/T-driven CANDLE workloads and Colmena steering experiments—the work demonstrates scalable, real-world applicability on exascale platforms. The paper argues that sustained DOE support, together with an interoperable, well-tested toolkit, is essential to unlock scalable, portable scientific workflows across diverse DOE facilities and future exascale architectures.

Abstract

Scientific discovery increasingly requires executing heterogeneous scientific workflows on high-performance computing (HPC) platforms. Heterogeneous workflows contain different types of tasks (e.g., simulation, analysis, and learning) that need to be mapped, scheduled, and launched on different computing. That requires a software stack that enables users to code their workflows and automate resource management and workflow execution. Currently, there are many workflow technologies with diverse levels of robustness and capabilities, and users face difficult choices of software that can effectively and efficiently support their use cases on HPC machines, especially when considering the latest exascale platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK). The SDK is a curated collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms. We present our experience with (1) curating those technologies, (2) integrating them to provide users with new capabilities, (3) developing a continuous integration platform to test the SDK on DOE HPC platforms, (4) designing a dashboard to publish the results of those tests, and (5) devising an innovative documentation platform to help users to use those technologies. Our experience details the requirements and the best practices needed to curate workflow technologies, and it also serves as a blueprint for the capabilities and services that DOE will have to offer to support a variety of scientific heterogeneous workflows on the newly available exascale HPC platforms.
Paper Structure (16 sections, 7 figures)

This paper contains 16 sections, 7 figures.

Figures (7)

  • Figure 1: ExaWorks SDK reference stack (right), current components (blue boxes), and examples of integration among components (purple boxes). SDK offers a variety of interfaces, programming models, and runtime capabilities to execute scientific workflows at scale on HPC platforms.
  • Figure 2: RPEX Architecture. Integration between Parsl (blue boxes) and RADICAL-Pilot (purple and green boxes) via a Task Translator function.
  • Figure 3: Integration of Flux into RADICAL-Pilot.
  • Figure 4: RADICAL Cybertools were used to implement a scalable UQ workflow with the ExaAM team. These plots show overall utilization for the Frontier challenge run was 448,000 CPU cores and 64,000 GPUs, not including 8 CPU cores per node reserved for system processes.
  • Figure 5: Progress of a typical "Challenge Problem: Leave One Out" campaign implemented for the CANDLE team with the Swift/T ExaWorks SDK component. Several restarts are performed at various scales over a month. During execution, workflow tasks are rapidly completed. This run was used to look for problems in the training data, and only 2 epochs were run per Uno (see the text) task.
  • ...and 2 more figures