Table of Contents
Fetching ...

Exascale Workflow Applications and Middleware: An ExaWorks Retrospective

Aymen Alsaadi, Mihael Hategan-Marandiuc, Ketan Maheshwari, Andre Merzky, Mikhail Titov, Matteo Turilli, Andreas Wilke, Justin M. Wozniak, Kyle Chard, Rafael Ferreira da Silva, Shantenu Jha, Daniel Laney

TL;DR

The ExaWorks project developed a workflow Software Development Toolkit (SDK), a curated collection of workflow technologies that can be composed and interoperated through a common interface, engineered following current best practices, and specifically designed to work on HPC platforms.

Abstract

Exascale computers offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. However, these software combinations and integrations are difficult to achieve due to the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We present the ExaWorks project, which addresses many of these challenges. We developed a workflow Software Development Toolkit (SDK), a curated collection of workflow technologies that can be composed and interoperated through a common interface, engineered following current best practices, and specifically designed to work on HPC platforms. ExaWorks also developed PSI/J, a job management abstraction API, to simplify the construction of portable software components and applications that can be used over various HPC schedulers. The PSI/J API is a minimal interface for submitting and monitoring jobs and their execution state across multiple and commonly used HPC schedulers. We also describe several leading and innovative workflow examples of ExaWorks tools used on DOE leadership platforms. Furthermore, we discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of workflows sustainably at the exascale.

Exascale Workflow Applications and Middleware: An ExaWorks Retrospective

TL;DR

The ExaWorks project developed a workflow Software Development Toolkit (SDK), a curated collection of workflow technologies that can be composed and interoperated through a common interface, engineered following current best practices, and specifically designed to work on HPC platforms.

Abstract

Exascale computers offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. However, these software combinations and integrations are difficult to achieve due to the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We present the ExaWorks project, which addresses many of these challenges. We developed a workflow Software Development Toolkit (SDK), a curated collection of workflow technologies that can be composed and interoperated through a common interface, engineered following current best practices, and specifically designed to work on HPC platforms. ExaWorks also developed PSI/J, a job management abstraction API, to simplify the construction of portable software components and applications that can be used over various HPC schedulers. The PSI/J API is a minimal interface for submitting and monitoring jobs and their execution state across multiple and commonly used HPC schedulers. We also describe several leading and innovative workflow examples of ExaWorks tools used on DOE leadership platforms. Furthermore, we discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of workflows sustainably at the exascale.

Paper Structure

This paper contains 19 sections, 3 figures.

Figures (3)

  • Figure 1: The ExaWorks SDK provides various interfaces, programming models, and runtime capabilities for executing scientific workflows at scale on high-performance computing (HPC) platforms. The reference stack (represented by the blue boxes) includes components that enable end-to-end capabilities, from workflow description to execution management (indicated by the vertical extent). The purple boxes highlight the integrations of multiple components, achieved by having each component expose its APIs for workflow and resource management, among other functions. For instance, the left-most purple stack illustrates the integration of PSI/J, which replaces Parsl's built-in scheduler support, alongside the use of Flux for job scheduling. The dashed line signifies the boundary between the workflow systems (operating in user space) and the specific capabilities of HPC platforms (such as Slurm). Both Flux and PSI/J operate in user and system space, with PSI/J bridging these two areas. Importantly, both systems remain independent of any system-specific plug-ins.
  • Figure 2: Illustration of the local layer of PSI/J.
  • Figure 3: Intended usage scenario for the remote layer of PSI/J.