Frustrated with MPI+Threads? Try MPIxThreads!

Hui Zhou; Ken Raffenetti; Junchao Zhang; Yanfei Guo; Rajeev Thakur

Frustrated with MPI+Threads? Try MPIxThreads!

Hui Zhou, Ken Raffenetti, Junchao Zhang, Yanfei Guo, Rajeev Thakur

TL;DR

This work addresses the inefficiencies of MPI+Threads by introducing MPIxThreads through the threadcomm extension in MPICH, which assigns distinct MPI ranks to threads inside a parallel region, creating an $N \times M$ communication space. It implements a thread-aware MPI API surface with blocking/nonblocking P2P and blocking collectives, underpinned by shared-memory messaging via a lockless MPSC queue and careful data layout with thread-local storage. Through case studies on point-to-point performance, collectives, and PETSc integration, the authors show threadcomm can outperform MPI-everywhere and integrate naturally with OpenMP, reducing code duplication and improving performance. Overall, the work offers a practical path to harmonize MPI and OpenMP, enabling dynamic expansion of MPI within a single node and easing access to MPI capabilities from within OpenMP regions.

Abstract

MPI+Threads, embodied by the MPI/OpenMP hybrid programming model, is a parallel programming paradigm where threads are used for on-node shared-memory parallelization and MPI is used for multi-node distributed-memory parallelization. OpenMP provides an incremental approach to parallelize code, while MPI, with its isolated address space and explicit messaging API, affords straightforward paths to obtain good parallel performance. However, MPI+Threads is not an ideal solution. Since MPI is unaware of the thread context, it cannot be used for interthread communication. This results in duplicated efforts to create separate and sometimes nested solutions for similar parallel tasks. In addition, because the MPI library is required to obey message-ordering semantics, mixing threads and MPI via MPI_THREAD_MULTIPLE can easily result in miserable performance due to accidental serializations. We propose a new MPI extension, MPIX Thread Communicator (threadcomm), that allows threads to be assigned distinct MPI ranks within thread parallel regions. The threadcomm extension combines both MPI processes and OpenMP threads to form a unified parallel environment. We show that this MPIxThreads (MPI Multiply Threads) paradigm allows OpenMP and MPI to work together in a complementary way to achieve both cleaner codes and better performance.

Frustrated with MPI+Threads? Try MPIxThreads!

TL;DR

communication space. It implements a thread-aware MPI API surface with blocking/nonblocking P2P and blocking collectives, underpinned by shared-memory messaging via a lockless MPSC queue and careful data layout with thread-local storage. Through case studies on point-to-point performance, collectives, and PETSc integration, the authors show threadcomm can outperform MPI-everywhere and integrate naturally with OpenMP, reducing code duplication and improving performance. Overall, the work offers a practical path to harmonize MPI and OpenMP, enabling dynamic expansion of MPI within a single node and easing access to MPI capabilities from within OpenMP regions.

Abstract

Paper Structure (14 sections, 6 figures)

This paper contains 14 sections, 6 figures.

Introduction
Proposal: MPIX threadcomm
Implementation
Shared memory and thread-local storage
Shared-memory messaging
Case studies
Case study: point-to-point latency and bandwidth
Case study: collectives
Case study: using PETSc
Related work
Special MPI implementations based on threads
Comparison with the MPI endpoints proposal
Discussion and Perspective
Summary

Figures (6)

Figure 1: Diagram showing a threadcomm created in a multithreaded parallel region with a size $N \times M$, where $N$ is the number of ranks in the parent communicator and $M$ is the number of threads in each process inside the parallel region.
Figure 2: Diagram showing the setup for the point-to-point case study. (a) Launch multiple MPI processes on a single node. (b) Launch a single process, use OpenMP to create a parallel region, and use threadcomm for MPI point-to-point messaging.
Figure 3: Point-to-point message latency and bandwidth comparison between MPI-everywhere and OpenMP+threadcomm on an Intel Xeon Gold 5317. Processes or threads are bound to cores on the same socket.
Figure 4: Latency comparison between the OpenMP barrier and MPI_Barrier via the thread communicator on an Intel Xeon Gold 5317. Threads are bound to cores.
Figure 5: Latency comparison between OpenMP reduction and MPI_Reduce via threadcomm on an Intel Xeon Gold 5317, using 16 threads bound to cores.
...and 1 more figures

Frustrated with MPI+Threads? Try MPIxThreads!

TL;DR

Abstract

Frustrated with MPI+Threads? Try MPIxThreads!

Authors

TL;DR

Abstract

Table of Contents

Figures (6)