Table of Contents
Fetching ...

The Landscape of GPU-Centric Communication

Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov

TL;DR

This paper surveys the transition from CPU-dominated to GPU-centric multi-GPU communication, detailing the hardware and software mechanisms that enable GPUs to initiate and manage data transfers. It taxonomyes intra-node and inter-node approaches, reviews vendor technologies (GPUDirect, UVA, UVM, IPC, NIC interactions, and interconnects), and analyzes major libraries (GPU-aware MPI, NCCL/RCCL, NVSHMEM, ROC_SHMEM, and UCX-based flows). The work highlights benefits, limitations, performance considerations, and open questions, emphasizing CPU-free networking, GPU-triggered communication, and the need for advanced debugging/profiling tools. Collectively, it provides researchers and practitioners with a consolidated view to design, implement, and optimize GPU-centric communication across software and hardware stacks, aiming to maximize multi-GPU scalability and efficiency.

Abstract

In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.

The Landscape of GPU-Centric Communication

TL;DR

This paper surveys the transition from CPU-dominated to GPU-centric multi-GPU communication, detailing the hardware and software mechanisms that enable GPUs to initiate and manage data transfers. It taxonomyes intra-node and inter-node approaches, reviews vendor technologies (GPUDirect, UVA, UVM, IPC, NIC interactions, and interconnects), and analyzes major libraries (GPU-aware MPI, NCCL/RCCL, NVSHMEM, ROC_SHMEM, and UCX-based flows). The work highlights benefits, limitations, performance considerations, and open questions, emphasizing CPU-free networking, GPU-triggered communication, and the need for advanced debugging/profiling tools. Collectively, it provides researchers and practitioners with a consolidated view to design, implement, and optimize GPU-centric communication across software and hardware stacks, aiming to maximize multi-GPU scalability and efficiency.

Abstract

In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.
Paper Structure (36 sections, 5 figures, 3 tables)

This paper contains 36 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Types of intra-node communication methods
  • Figure 1: Data paths and API calls of intra-node communication methods
  • Figure 2: Types of inter-node communication methods. Cells in bold refer to where a change or optimization has been made. D/H means that both device-side and host-side API calls may belong to this type.
  • Figure 2: Inter-node communication data and control paths.
  • Figure 3: Timeline of NVIDIA technologies enabling GPU-centric communication and networking.