Frustrated with MPI+Threads? Try MPIxThreads!
Hui Zhou, Ken Raffenetti, Junchao Zhang, Yanfei Guo, Rajeev Thakur
TL;DR
This work addresses the inefficiencies of MPI+Threads by introducing MPIxThreads through the threadcomm extension in MPICH, which assigns distinct MPI ranks to threads inside a parallel region, creating an $N \times M$ communication space. It implements a thread-aware MPI API surface with blocking/nonblocking P2P and blocking collectives, underpinned by shared-memory messaging via a lockless MPSC queue and careful data layout with thread-local storage. Through case studies on point-to-point performance, collectives, and PETSc integration, the authors show threadcomm can outperform MPI-everywhere and integrate naturally with OpenMP, reducing code duplication and improving performance. Overall, the work offers a practical path to harmonize MPI and OpenMP, enabling dynamic expansion of MPI within a single node and easing access to MPI capabilities from within OpenMP regions.
Abstract
MPI+Threads, embodied by the MPI/OpenMP hybrid programming model, is a parallel programming paradigm where threads are used for on-node shared-memory parallelization and MPI is used for multi-node distributed-memory parallelization. OpenMP provides an incremental approach to parallelize code, while MPI, with its isolated address space and explicit messaging API, affords straightforward paths to obtain good parallel performance. However, MPI+Threads is not an ideal solution. Since MPI is unaware of the thread context, it cannot be used for interthread communication. This results in duplicated efforts to create separate and sometimes nested solutions for similar parallel tasks. In addition, because the MPI library is required to obey message-ordering semantics, mixing threads and MPI via MPI_THREAD_MULTIPLE can easily result in miserable performance due to accidental serializations. We propose a new MPI extension, MPIX Thread Communicator (threadcomm), that allows threads to be assigned distinct MPI ranks within thread parallel regions. The threadcomm extension combines both MPI processes and OpenMP threads to form a unified parallel environment. We show that this MPIxThreads (MPI Multiply Threads) paradigm allows OpenMP and MPI to work together in a complementary way to achieve both cleaner codes and better performance.
