Designing and Prototyping Extensions to MPI in MPICH

Hui Zhou; Ken Raffenetti; Yanfei Guo; Thomas Gillis; Robert Latham; Rajeev Thakur

Designing and Prototyping Extensions to MPI in MPICH

Hui Zhou, Ken Raffenetti, Yanfei Guo, Thomas Gillis, Robert Latham, Rajeev Thakur

TL;DR

This paper documents nonstandard MPI extensions implemented in MPICH (version 4.2.0) to address hybrid HPC needs, including interoperability with on-node runtimes and accelerators. The six extensions—MPIX_Grequest (generalized requests with poll/wait progress), MPIX_Type_iov (datatype iovec querying), MPIX_Stream (execution-context mapping), offloading enqueue (GPU-stream execution), MPIX_Threadcomm (thread-level MPI), and MPIX_Stream_progress (custom progress control)—are presented with API prototypes, usage examples, and evaluation results. Microbenchmarks and examples show tangible performance gains (e.g., up to ~20% in multithreaded messaging) and clearer programming models for MPI+Threads and MPI+GPU scenarios, illustrating practical interoperability improvements. The work emphasizes backward-compatible experimentation to inform future MPI standardization and demonstrates MPICH as a testbed for proactive ecosystem feedback, supported by Exascale Computing Project funding and intended inclusion in future MPI specifications.

Abstract

As HPC system architectures and the applications running on them continue to evolve, the MPI standard itself must evolve. The trend in current and future HPC systems toward powerful nodes with multiple CPU cores and multiple GPU accelerators makes efficient support for hybrid programming critical for applications to achieve high performance. However, the support for hybrid programming in the MPI standard has not kept up with recent trends. The MPICH implementation of MPI provides a platform for implementing and experimenting with new proposals and extensions to fill this gap and to gain valuable experience and feedback before the MPI Forum can consider them for standardization. In this work, we detail six extensions implemented in MPICH to increase MPI interoperability with other runtimes, with a specific focus on heterogeneous architectures. First, the extension to MPI generalized requests lets applications integrate asynchronous tasks into MPI's progress engine. Second, the iovec extension to datatypes lets applications use MPI datatypes as a general-purpose data layout API beyond just MPI communications. Third, a new MPI object, MPIX stream, can be used by applications to identify execution contexts beyond MPI processes, including threads and GPU streams. MPIX stream communicators can be created to make existing MPI functions thread-aware and GPU-aware, thus providing applications with explicit ways to achieve higher performance. Fourth, MPIX Streams are extended to support the enqueue semantics for offloading MPI communications onto a GPU stream context. Fifth, thread communicators allow MPI communicators to be constructed with individual threads, thus providing a new level of interoperability between MPI and on-node runtimes such as OpenMP. Lastly, we present an extension to invoke MPI progress, which lets users spawn progress threads with fine-grained control.

Designing and Prototyping Extensions to MPI in MPICH

TL;DR

Abstract

Paper Structure (27 sections, 8 figures)

This paper contains 27 sections, 8 figures.

Introduction
Generalized Requests
Background
Extension
Example
Derived Datatypes
Extension
Example
MPIX Streams
Background
Extension
Example
Evaluation
Offloading Asynchronous Operations
Background
...and 12 more sections

Figures (8)

Figure 1: Diagrams illustrating asynchronous operations via MPI generalized request. (a) The current standard API requires background threads to complete the request. (b) Extension may eliminate the need for a background thread.
Figure 2: Some MPI derived datatypes with illustrations of their creation routines.
Figure 3: Diagram illustrating mapping communications to network endpoints. (a) Implicit scheme maps communications implicitly to internal virtual communication interfaces, requires locking, and may result in mismapping. (b) Explicit scheme requires explicit context from communicators and may eliminate locking.
Figure 4: Multithread message rate on 8-byte messages using MPI_Isend/MPI_Irecv. The message rate using MPIX_Stream is around $20\%$ higher than with implicit VCIs.
Figure 5: Diagram illustrating how MPI operations are launched into the accelerator context and triggered to run under the host context.
...and 3 more figures

Designing and Prototyping Extensions to MPI in MPICH

TL;DR

Abstract

Designing and Prototyping Extensions to MPI in MPICH

Authors

TL;DR

Abstract

Table of Contents

Figures (8)