PSI/J: A Portable Interface for Submitting, Monitoring, and Managing Jobs
Mihael Hategan-Marandiuc, Andre Merzky, Nicholson Collier, Ketan Maheshwari, Jonathan Ozik, Matteo Turilli, Andreas Wilke, Justin M. Wozniak, Kyle Chard, Ian Foster, Rafael Ferreira da Silva, Shantenu Jha, Daniel Laney
TL;DR
PSI/J addresses the challenge of portable HPC applications across diverse LRMs by introducing a minimal, language-agnostic JAAPI with a three-layer API structure (local, remote, nested) and a dynamic plugin system of executors and launchers. The approach emphasizes API simplicity, asynchronous operation for scalability, and the ability to reuse common implementations via text-based LRM templates, enabling lightweight, user-space deployment and cross-system portability. The paper provides a comprehensive design rationale, a reference Python implementation, and demonstrates practical integration with Parsl, RADICAL-Pilot, Swift/T, and OSPREY, along with a lower-bound overhead assessment that indicates PSI/J adds minimal latency ($ ext{ms}$ per job). The work argues for community-driven evolution and testing infrastructure to sustain JAAPIs in the face of intrinsic and extrinsic adoption barriers, highlighting the potential for durable, scalable HPC workflow portability.
Abstract
It is generally desirable for high-performance computing (HPC) applications to be portable between HPC systems, for example to make use of more performant hardware, make effective use of allocations, and to co-locate compute jobs with large datasets. Unfortunately, moving scientific applications between HPC systems is challenging for various reasons, most notably that HPC systems have different HPC schedulers. We introduce PSI/J, a job management abstraction API intended to simplify the construction of software components and applications that are portable over various HPC scheduler implementations. We argue that such a system is both necessary and that no viable alternative currently exists. We analyze similar notable APIs and attempt to determine the factors that influenced their evolution and adoption by the HPC community. We base the design of PSI/J on that analysis. We describe how PSI/J has been integrated in three workflow systems and one application, and also show via experiments that PSI/J imposes minimal overhead.
