Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

Patrick G. Bridges; Derek Schafer; Jack Lange; James B. White; Anthony Skjellum; Evan Suggs; Thomas Hines; Purushotham Bangalore; Matthew G. F. Dosanjh; Whit Schonbein

Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

Patrick G. Bridges, Derek Schafer, Jack Lange, James B. White, Anthony Skjellum, Evan Suggs, Thomas Hines, Purushotham Bangalore, Matthew G. F. Dosanjh, Whit Schonbein

TL;DR

This paper describes the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication and demonstrates the utility and performance by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework.

Abstract

Removing the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that place significant synchronization burdens on programmers. In this paper we describe the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication. This API builds on previously proposed MPI extensions and leverages HPE Slingshot 11 network card capabilities. We demonstrate the utility and performance of the API by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework, and through a performance comparison with Cray MPICH on the Frontier and Tuolumne supercomputers. Results from this evaluation show up to a 50% reduction in medium message latency in simple GPU ping-pong exchanges and a 28% speedup improvement when strong scaling a halo-exchange benchmark to 8,192 GPUs of the Frontier supercomputer.

Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

TL;DR

Abstract

Paper Structure (28 sections, 8 figures, 5 tables)

This paper contains 28 sections, 8 figures, 5 tables.

Introduction
GPU Communication Background
Overheads in GPU-Aware Communication Exchanges
Reducing GPU Communication Overheads
Slingshot 11/libfabric GPU Communication Features
Proposed MPI GPU Triggering API
Overview
MPI_Queue: Ordering Progress with External Execution
MPI_Match: Enabling Stream Triggering of Legacy Persistent Operations
Concurrency Management using MPI_Queue
Prototyping Halo Exchanges with the Revised API
Achieving CPU-Free MPI GPU Communication
Send and Receive with GPU Communication Counters
Handling Receiver Readiness Checks
Deadlock and Resource Exhaustion Analysis
...and 13 more sections

Figures (8)

Figure 1: Simplified CPU-driven GPU computation and MPI communication. Most current applications use this CPU-based approach to communicating data to and from GPU computations.
Figure 2: Simplified version of MPI code to initialize a halo gather operation in the StreamHalo object.
Figure 3: Simplified version of MPI code to enqueue a ghost cell gather operation in the StreamHalo object.
Figure 4: Timeline for our stream-triggered implementation of send, ready send, and receive operations with respect to the CPU, GPU, and NIC. Dotted boxes represent deferred work queue operations being carried out; orange boxes represent actions specific to the implementation of regular (non-ready) send, as described in Section \ref{['sec:impl:readiness']}.
Figure 5: Cray MPICH and Stream-Triggered GPU packing ping-pong bandwidth and latency between two Frontier nodes.
...and 3 more figures

Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

TL;DR

Abstract

Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)