NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

Amos Goldman; Nimrod Boker; Maayan Sheraizin; Nimrod Admoni; Artem Polyakov; Subhadeep Bhattacharya; Fan Yu; Kai Sun; Georgios Theodorakis; Hsin-Chun Yin; Peter-Jan Gootzen; Aamir Shafi; Assaf Ravid; Salvatore Di Girolamo; Manjunath Gorentla Venkata; Gil Bloch

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Admoni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, Peter-Jan Gootzen, Aamir Shafi, Assaf Ravid, Salvatore Di Girolamo, Manjunath Gorentla Venkata, Gil Bloch

Abstract

Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains before inter-node RDMA transmission. Both modes leverage Device API for both intra- and inter-node communications, taking advantage of its topology awareness and optimized GPU-initiated implementation. We evaluate NCCL EP on an H100-based cluster across multi-node configurations, demonstrating competitive LL kernel performance and presenting end-to-end results with vLLM integration. By building MoE communication natively within NCCL, NCCL EP provides a supported path for expert parallelism on current and emerging NVIDIA platforms.

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

Abstract

Paper Structure (53 sections, 3 equations, 8 figures, 7 tables)

This paper contains 53 sections, 3 equations, 8 figures, 7 tables.

Introduction
Contributions
Background
Transformer architecture
MoE Communication Patterns
NCCL Device API
The MoE Communication Landscape
NCCL EP: Design and API
Design Philosophy
Core Operations
Resource Management
MoE Group
MoE Handle
Algorithm Modes
Tensor Abstraction
...and 38 more sections

Figures (8)

Figure 1: NCCL EP architecture: low-latency kernels adapted from DeepEP deepep2025 and high-throughput kernels adapted from Hybrid-EP hybridep use NCCL GIN (GPU-Initiated Networking) for inter-node RDMA, while retaining their native NVLink implementations for intra-node communication.
Figure 2: NCCL EP execution flow: Group creation (once), followed by repeated Handle$\rightarrow$Dispatch$\rightarrow$Expert FFN$\rightarrow$Combine cycles. Staged mode splits dispatch and combine into send and receive phases for overlap.
Figure 3: LL mode: 2D input to 3D expert-major output.
Figure 4: HT mode: 2D input to 2D concatenated output.
Figure 5: MoE training flow: handles are created before dispatch and shared between forward and backward passes to maintain routing state.
...and 3 more figures

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

Abstract

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

Authors

Abstract

Table of Contents

Figures (8)