Table of Contents
Fetching ...

Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

Patrick G. Bridges, Derek Schafer, Jack Lange, James B. White, Anthony Skjellum, Evan Suggs, Thomas Hines, Purushotham Bangalore, Matthew G. F. Dosanjh, Whit Schonbein

TL;DR

This paper describes the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication and demonstrates the utility and performance by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework.

Abstract

Removing the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that place significant synchronization burdens on programmers. In this paper we describe the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication. This API builds on previously proposed MPI extensions and leverages HPE Slingshot 11 network card capabilities. We demonstrate the utility and performance of the API by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework, and through a performance comparison with Cray MPICH on the Frontier and Tuolumne supercomputers. Results from this evaluation show up to a 50% reduction in medium message latency in simple GPU ping-pong exchanges and a 28% speedup improvement when strong scaling a halo-exchange benchmark to 8,192 GPUs of the Frontier supercomputer.

Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

TL;DR

This paper describes the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication and demonstrates the utility and performance by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework.

Abstract

Removing the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that place significant synchronization burdens on programmers. In this paper we describe the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication. This API builds on previously proposed MPI extensions and leverages HPE Slingshot 11 network card capabilities. We demonstrate the utility and performance of the API by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework, and through a performance comparison with Cray MPICH on the Frontier and Tuolumne supercomputers. Results from this evaluation show up to a 50% reduction in medium message latency in simple GPU ping-pong exchanges and a 28% speedup improvement when strong scaling a halo-exchange benchmark to 8,192 GPUs of the Frontier supercomputer.
Paper Structure (28 sections, 8 figures, 5 tables)

This paper contains 28 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Simplified CPU-driven GPU computation and MPI communication. Most current applications use this CPU-based approach to communicating data to and from GPU computations.
  • Figure 2: Simplified version of MPI code to initialize a halo gather operation in the StreamHalo object.
  • Figure 3: Simplified version of MPI code to enqueue a ghost cell gather operation in the StreamHalo object.
  • Figure 4: Timeline for our stream-triggered implementation of send, ready send, and receive operations with respect to the CPU, GPU, and NIC. Dotted boxes represent deferred work queue operations being carried out; orange boxes represent actions specific to the implementation of regular (non-ready) send, as described in Section \ref{['sec:impl:readiness']}.
  • Figure 5: Cray MPICH and Stream-Triggered GPU packing ping-pong bandwidth and latency between two Frontier nodes.
  • ...and 3 more figures