Table of Contents
Fetching ...

GPU-Initiated Networking for NCCL

Khaled Hamidouche, John Bachan, Pak Markthub, Peter-Jan Gootzen, Elena Agostini, Sylvain Jeaugey, Aamir Shafi, Georgios Theodorakis, Manjunath Gorentla Venkata

TL;DR

This work introduces GPU-Initiated Networking (GIN) as part of NCCL 2.28 to enable direct GPU-driven network operations from CUDA kernels, thereby reducing CPU coordination overhead for fine-grained communication patterns common in MoE and kernel-fusion workloads. GIN implements a three-layer architecture (host NCCL Core APIs, device-side GIN API, and a pluggable network backend layer) with two backends: GDAKI for direct GPU-to-NIC communication and Proxy for CPU-assisted operation on standard RDMA NICs, preserving NCCL’s ecosystem while enabling device-initiated primitives. The authors demonstrate DeepEP integration and provide comprehensive microbenchmarks and application-level results showing competitive latency and bandwidth relative to NVSHMEM transports, validating practicality and scalability across multi-node GPU clusters. The work highlights significant practical impact by enabling tight computation-communication coupling within a unified NCCL runtime, facilitating MoE inference, kernel fusion, and broader production deployment of GPU-initiated networking.

Abstract

Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations - a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead. NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side APIs for device communicator setup and collective memory window registration; ii) Device-side APIs for remote memory operations callable from CUDA kernels; and iii) A network plugin architecture with dual semantics (GPUDirect Async Kernel-Initiated and Proxy) for broad hardware support. The GPUDirect Async Kernel-Initiated backend leverages DOCA GPUNetIO for direct GPU-to-NIC communication, while the Proxy backend provides equivalent functionality via lock-free GPU-to-CPU queues over standard RDMA networks. We demonstrate GIN's practicality through integration with DeepEP, an MoE communication library. Comprehensive benchmarking shows that GIN provides device-initiated communication within NCCL's unified runtime, combining low-latency operations with NCCL's collective algorithms and production infrastructure.

GPU-Initiated Networking for NCCL

TL;DR

This work introduces GPU-Initiated Networking (GIN) as part of NCCL 2.28 to enable direct GPU-driven network operations from CUDA kernels, thereby reducing CPU coordination overhead for fine-grained communication patterns common in MoE and kernel-fusion workloads. GIN implements a three-layer architecture (host NCCL Core APIs, device-side GIN API, and a pluggable network backend layer) with two backends: GDAKI for direct GPU-to-NIC communication and Proxy for CPU-assisted operation on standard RDMA NICs, preserving NCCL’s ecosystem while enabling device-initiated primitives. The authors demonstrate DeepEP integration and provide comprehensive microbenchmarks and application-level results showing competitive latency and bandwidth relative to NVSHMEM transports, validating practicality and scalability across multi-node GPU clusters. The work highlights significant practical impact by enabling tight computation-communication coupling within a unified NCCL runtime, facilitating MoE inference, kernel fusion, and broader production deployment of GPU-initiated networking.

Abstract

Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations - a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead. NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side APIs for device communicator setup and collective memory window registration; ii) Device-side APIs for remote memory operations callable from CUDA kernels; and iii) A network plugin architecture with dual semantics (GPUDirect Async Kernel-Initiated and Proxy) for broad hardware support. The GPUDirect Async Kernel-Initiated backend leverages DOCA GPUNetIO for direct GPU-to-NIC communication, while the Proxy backend provides equivalent functionality via lock-free GPU-to-CPU queues over standard RDMA networks. We demonstrate GIN's practicality through integration with DeepEP, an MoE communication library. Comprehensive benchmarking shows that GIN provides device-initiated communication within NCCL's unified runtime, combining low-latency operations with NCCL's collective algorithms and production infrastructure.

Paper Structure

This paper contains 26 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: NCCL Device API architecture showing three operation modes and their underlying interconnect technologies. Load/Store Accessible (LSA) uses PCIe and NVLink for intra-node memory operations, Multimem leverages NVLink SHARP for hardware multicast, and GPU-Initiated Networking (GIN) provides dual backend implementations (GDAKI and Proxy) for network-based communication over InfiniBand and RoCE.
  • Figure 2: High-level architecture comparison: NCCL Device API (left) with three operation modes (Load/Store Accessible for NVLink/PCIe, Multimem for NVLink SHARP, GIN for Network RDMA) enables single-shot collective algorithms with collective symmetric memory, while traditional NCCL (right) uses host-initiated algorithms with pipeline primitives over regular memory.
  • Figure 3: GIN Architecture showing the interactions between NCCL Core, Plugin Layer, and Device-Side API.
  • Figure 4: Point-to-point latency: NVSHMEM IBGDA/IBRC and NCCL GIN GDAKI/Proxy backends.
  • Figure 5: HT kernel bandwidth for NCCL GIN and NVSHMEM.
  • ...and 4 more figures