Table of Contents
Fetching ...

GPU-centric Communication Schemes for HPC and ML Applications

Naveen Namashivayam

TL;DR

This paper addresses the bottleneck of inter-process communication in large-scale heterogeneous HPC/ML systems by surveying GPU-centric schemes that move the communication control path from the CPU to the GPU. It introduces three main approaches—Stream Triggered (ST), Kernel Triggered (KT), and Kernel Initiated (KI)—and discusses the hardware features (GPUDirect, Async, RDMA, ODP) and software libraries necessary to implement them. The work foregrounds how these schemes can improve compute/communication overlap, reduce latency, and grant greater GPU autonomy, with detailed analyses of system layouts, implementation requirements, and applicable communication patterns. The findings have practical impact for both HPC and ML workloads by guiding design choices for future interconnects, NIC offloads, and middleware integration to enable scalable GPU-driven data movement.

Abstract

Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable simulation and deep learning workloads. The resulting inter-process communication from the distributed execution of these parallel workloads is one of the key factors contributing to its performance bottleneck. Most programming models and runtime systems enabling the communication requirements on these systems support GPU-aware communication schemes that move the GPU-attached communication buffers in the application directly from the GPU to the NIC without staging through the host memory. A CPU thread is required to orchestrate the communication operations even with support for such GPU-awareness. This survey discusses various available GPU-centric communication schemes that move the control path of the communication operations from the CPU to the GPU. This work presents the need for the new communication schemes, various GPU and NIC capabilities required to implement the schemes, and the potential use-cases addressed. Based on these discussions, challenges involved in supporting the exhibited GPU-centric communication schemes are discussed.

GPU-centric Communication Schemes for HPC and ML Applications

TL;DR

This paper addresses the bottleneck of inter-process communication in large-scale heterogeneous HPC/ML systems by surveying GPU-centric schemes that move the communication control path from the CPU to the GPU. It introduces three main approaches—Stream Triggered (ST), Kernel Triggered (KT), and Kernel Initiated (KI)—and discusses the hardware features (GPUDirect, Async, RDMA, ODP) and software libraries necessary to implement them. The work foregrounds how these schemes can improve compute/communication overlap, reduce latency, and grant greater GPU autonomy, with detailed analyses of system layouts, implementation requirements, and applicable communication patterns. The findings have practical impact for both HPC and ML workloads by guiding design choices for future interconnects, NIC offloads, and middleware integration to enable scalable GPU-driven data movement.

Abstract

Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable simulation and deep learning workloads. The resulting inter-process communication from the distributed execution of these parallel workloads is one of the key factors contributing to its performance bottleneck. Most programming models and runtime systems enabling the communication requirements on these systems support GPU-aware communication schemes that move the GPU-attached communication buffers in the application directly from the GPU to the NIC without staging through the host memory. A CPU thread is required to orchestrate the communication operations even with support for such GPU-awareness. This survey discusses various available GPU-centric communication schemes that move the control path of the communication operations from the CPU to the GPU. This work presents the need for the new communication schemes, various GPU and NIC capabilities required to implement the schemes, and the potential use-cases addressed. Based on these discussions, challenges involved in supporting the exhibited GPU-centric communication schemes are discussed.

Paper Structure

This paper contains 42 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Representing a traditional HPC heterogeneous system architecture with four compute nodes connected across a network. The heterogeneous compute nodes represent a host CPU attached to two GPU devices. Eight tasks are created for the distributed application, each placed on the same compute node.
  • Figure 2: Representing a simple data movement from a source to target process in a scale-out network.
  • Figure 3: Representing non-GPU-aware communication scheme.
  • Figure 4: Representing GPU-aware communication scheme.
  • Figure 5: Representing a GPU-aware application using ST.
  • ...and 3 more figures