Designing and Prototyping Extensions to MPI in MPICH
Hui Zhou, Ken Raffenetti, Yanfei Guo, Thomas Gillis, Robert Latham, Rajeev Thakur
TL;DR
This paper documents nonstandard MPI extensions implemented in MPICH (version 4.2.0) to address hybrid HPC needs, including interoperability with on-node runtimes and accelerators. The six extensions—MPIX_Grequest (generalized requests with poll/wait progress), MPIX_Type_iov (datatype iovec querying), MPIX_Stream (execution-context mapping), offloading enqueue (GPU-stream execution), MPIX_Threadcomm (thread-level MPI), and MPIX_Stream_progress (custom progress control)—are presented with API prototypes, usage examples, and evaluation results. Microbenchmarks and examples show tangible performance gains (e.g., up to ~20% in multithreaded messaging) and clearer programming models for MPI+Threads and MPI+GPU scenarios, illustrating practical interoperability improvements. The work emphasizes backward-compatible experimentation to inform future MPI standardization and demonstrates MPICH as a testbed for proactive ecosystem feedback, supported by Exascale Computing Project funding and intended inclusion in future MPI specifications.
Abstract
As HPC system architectures and the applications running on them continue to evolve, the MPI standard itself must evolve. The trend in current and future HPC systems toward powerful nodes with multiple CPU cores and multiple GPU accelerators makes efficient support for hybrid programming critical for applications to achieve high performance. However, the support for hybrid programming in the MPI standard has not kept up with recent trends. The MPICH implementation of MPI provides a platform for implementing and experimenting with new proposals and extensions to fill this gap and to gain valuable experience and feedback before the MPI Forum can consider them for standardization. In this work, we detail six extensions implemented in MPICH to increase MPI interoperability with other runtimes, with a specific focus on heterogeneous architectures. First, the extension to MPI generalized requests lets applications integrate asynchronous tasks into MPI's progress engine. Second, the iovec extension to datatypes lets applications use MPI datatypes as a general-purpose data layout API beyond just MPI communications. Third, a new MPI object, MPIX stream, can be used by applications to identify execution contexts beyond MPI processes, including threads and GPU streams. MPIX stream communicators can be created to make existing MPI functions thread-aware and GPU-aware, thus providing applications with explicit ways to achieve higher performance. Fourth, MPIX Streams are extended to support the enqueue semantics for offloading MPI communications onto a GPU stream context. Fifth, thread communicators allow MPI communicators to be constructed with individual threads, thus providing a new level of interoperability between MPI and on-node runtimes such as OpenMP. Lastly, we present an extension to invoke MPI progress, which lets users spawn progress threads with fine-grained control.
