Table of Contents
Fetching ...

MPI Progress For All

Hui Zhou, Robert Latham, Ken Raffenetti, Yanfei Guo, Rajeev Thakur

TL;DR

The paper addresses the ambiguous semantics of MPI progress and its impact on computation/communication overlap in HPC. It analyzes MPI messaging patterns and general asynchronous task behavior to advocate interoperable MPI progress and a set of extensions (including MPIX Streams, MPIX_Async, MPIX_Stream_progress, and MPIX_Request_is_complete) that decouple progress from user-level task contexts. It then presents a programming scheme and concrete examples (dummy tasks, per-stream progress, event-like callbacks, and user-level collectibles) to demonstrate how explicit progress can be integrated with task-based and event-driven runtimes, often achieving competitive or superior latency/throughput compared to traditional MPI approaches. The work argues that exposing interoperable progress enables rapid prototyping of MPI algorithms and user-level collectives while reducing complexity and contention, thereby bridging MPI with modern asynchronous programming practices and enabling scalable, architecture-aware progress management.

Abstract

The progression of communication in the Message Passing Interface (MPI) is not well defined, yet it is critical for application performance, particularly in achieving effective computation and communication overlap. The opaque nature of MPI progress poses significant challenges in advancing MPI within modern high-performance computing (HPC) practices. Firstly, the lack of clarity hinders the development of explicit guidelines for enhancing computation and communication overlap in applications. Secondly, it prevents MPI from seamlessly integrating with contemporary programming paradigms, such as task-based runtimes and event-driven programming. Thirdly, it limits the extension of MPI functionalities from the user space. In this paper, we examine the role of MPI progress by analyzing the implementation details of MPI messaging. We then generalize the asynchronous communication pattern and identify key factors influencing application performance. Based on this analysis, we propose a set of MPI extensions designed to enable users to explicitly construct and manage an efficient progress engine. We provide example codes to demonstrate the use of these proposed APIs in achieving improved performance, adapting MPI to task-based or event-driven programming styles, and constructing collective algorithms that rival the performance of native implementations. Our approach is compared to previous efforts in the field, highlighting its reduced complexity and increased effectiveness.

MPI Progress For All

TL;DR

The paper addresses the ambiguous semantics of MPI progress and its impact on computation/communication overlap in HPC. It analyzes MPI messaging patterns and general asynchronous task behavior to advocate interoperable MPI progress and a set of extensions (including MPIX Streams, MPIX_Async, MPIX_Stream_progress, and MPIX_Request_is_complete) that decouple progress from user-level task contexts. It then presents a programming scheme and concrete examples (dummy tasks, per-stream progress, event-like callbacks, and user-level collectibles) to demonstrate how explicit progress can be integrated with task-based and event-driven runtimes, often achieving competitive or superior latency/throughput compared to traditional MPI approaches. The work argues that exposing interoperable progress enables rapid prototyping of MPI algorithms and user-level collectives while reducing complexity and contention, thereby bridging MPI with modern asynchronous programming practices and enabling scalable, architecture-aware progress management.

Abstract

The progression of communication in the Message Passing Interface (MPI) is not well defined, yet it is critical for application performance, particularly in achieving effective computation and communication overlap. The opaque nature of MPI progress poses significant challenges in advancing MPI within modern high-performance computing (HPC) practices. Firstly, the lack of clarity hinders the development of explicit guidelines for enhancing computation and communication overlap in applications. Secondly, it prevents MPI from seamlessly integrating with contemporary programming paradigms, such as task-based runtimes and event-driven programming. Thirdly, it limits the extension of MPI functionalities from the user space. In this paper, we examine the role of MPI progress by analyzing the implementation details of MPI messaging. We then generalize the asynchronous communication pattern and identify key factors influencing application performance. Based on this analysis, we propose a set of MPI extensions designed to enable users to explicitly construct and manage an efficient progress engine. We provide example codes to demonstrate the use of these proposed APIs in achieving improved performance, adapting MPI to task-based or event-driven programming styles, and constructing collective algorithms that rival the performance of native implementations. Our approach is compared to previous efforts in the field, highlighting its reduced complexity and increased effectiveness.
Paper Structure (33 sections, 13 figures)

This paper contains 33 sections, 13 figures.

Figures (13)

  • Figure 1: Common communication modes: (a) Buffered eager send; (b) Normal eager send; (c) Rendezvous send; (d) Receiving an eager message that arrived before posting the receive; (e) Receiving an eager message that arrived after posting the receive; (f) Receiving a rendezvous message.
  • Figure 2: Task patterns: (a) A task with no blocking parts; (b) A task with a single blocking part; (c) A task with multiple blocking parts.
  • Figure 3: Nonblocking tasks: (a) A task with no blocking parts; (b) A task with a single blocking part; (c) A task with multiple blocking parts.
  • Figure 4: Computation/communication overlap: (a) Communication with no blocking parts; (b) Communication with single blocking part; (c) Communication with multiple blocking parts.
  • Figure 5: Remedies for the lack of progress: (a) Intersperse progress tests inside computations; (b) Use a dedicated thread to continuously poll progress.
  • ...and 8 more figures