MPI Progress For All
Hui Zhou, Robert Latham, Ken Raffenetti, Yanfei Guo, Rajeev Thakur
TL;DR
The paper addresses the ambiguous semantics of MPI progress and its impact on computation/communication overlap in HPC. It analyzes MPI messaging patterns and general asynchronous task behavior to advocate interoperable MPI progress and a set of extensions (including MPIX Streams, MPIX_Async, MPIX_Stream_progress, and MPIX_Request_is_complete) that decouple progress from user-level task contexts. It then presents a programming scheme and concrete examples (dummy tasks, per-stream progress, event-like callbacks, and user-level collectibles) to demonstrate how explicit progress can be integrated with task-based and event-driven runtimes, often achieving competitive or superior latency/throughput compared to traditional MPI approaches. The work argues that exposing interoperable progress enables rapid prototyping of MPI algorithms and user-level collectives while reducing complexity and contention, thereby bridging MPI with modern asynchronous programming practices and enabling scalable, architecture-aware progress management.
Abstract
The progression of communication in the Message Passing Interface (MPI) is not well defined, yet it is critical for application performance, particularly in achieving effective computation and communication overlap. The opaque nature of MPI progress poses significant challenges in advancing MPI within modern high-performance computing (HPC) practices. Firstly, the lack of clarity hinders the development of explicit guidelines for enhancing computation and communication overlap in applications. Secondly, it prevents MPI from seamlessly integrating with contemporary programming paradigms, such as task-based runtimes and event-driven programming. Thirdly, it limits the extension of MPI functionalities from the user space. In this paper, we examine the role of MPI progress by analyzing the implementation details of MPI messaging. We then generalize the asynchronous communication pattern and identify key factors influencing application performance. Based on this analysis, we propose a set of MPI extensions designed to enable users to explicitly construct and manage an efficient progress engine. We provide example codes to demonstrate the use of these proposed APIs in achieving improved performance, adapting MPI to task-based or event-driven programming styles, and constructing collective algorithms that rival the performance of native implementations. Our approach is compared to previous efforts in the field, highlighting its reduced complexity and increased effectiveness.
