Table of Contents
Fetching ...

An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters

Ziteng Chen, Xiaohe Hu, Menghao Zhang, Yanmin Jia, Yan Zhang, Mingjun Zhang, Da Liu, Fangzheng Jiao, Jun Chen, He Liu, Aohan Zeng, Shuaixing Duan, Ruya Gu, Yang Jing, Bowen Han, Jiahao Cao, Wei Chen, Wenqi Xie, Jinlong Hou, Yuan Cheng, Bohua Xu, Mingwei Xu, Chunming Hu

TL;DR

ICCL tackles three core NCCL limitations in large-scale GPU training clusters: excessive SM consumption from non-reduction P2P, vulnerability to RNIC port failures, and lack of fine-grained observability. It achieves these goals with a DPDK-like P2P that offloads P2P from GPU kernels to CPU threads and enables zero-copy memory access, a primary-backup QP mechanism for NIC port failure tolerance, and a window-based RDMA observability monitor with $O(\mu s)$ granularity. Empirical results on production-like clusters show sizable gains: ~23.4% higher P2P throughput, ~28.5% lower P2P latency, and ~6.02% higher training throughput, plus robust operation under RNIC failures (76.6% CC throughput preserved) and detailed runtime observability. The authors also share real-world operating experiences and practical guidance for deploying production-level CC libraries in LLM training, emphasizing resource health, topology-aware channel construction, and efficient monitoring.

Abstract

Large-scale LLM training requires collective communication libraries to exchange data among distributed GPUs. As a company dedicated to building and operating large-scale GPU training clusters, we encounter several challenges when using NCCL in production, including 1) limited efficiency with costly and cumbersome P2P communication, 2) poor tolerance to frequent RNIC port failures, and 3) insufficient observability of transient collective communication anomalies. To address these issues, we propose ICCL, an efficient, reliable, and observable collective communication library in large-scale GPU training clusters. ICCL offloads the P2P communication from GPU kernels to CPU threads for minimal SM consumption, and removes the redundant memory copies irrelevant to the actual communicating process. ICCL also introduces a primary-backup QP mechanism to tolerate frequent NIC port failures, and designs a window-based monitor to observe network anomalies at O(us) level. We open-source ICCL and deploy it in production training clusters for several months, with results showing that compared to NCCL, ICCL achieves a 23.4%/28.5% improvement in P2P throughput/latency as well as a 6.02% increase in training throughput. We also share the operating experience of ICCL in large-scale clusters, hoping to give the communities more insights on production-level collective communication libraries in LLM training.

An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters

TL;DR

ICCL tackles three core NCCL limitations in large-scale GPU training clusters: excessive SM consumption from non-reduction P2P, vulnerability to RNIC port failures, and lack of fine-grained observability. It achieves these goals with a DPDK-like P2P that offloads P2P from GPU kernels to CPU threads and enables zero-copy memory access, a primary-backup QP mechanism for NIC port failure tolerance, and a window-based RDMA observability monitor with granularity. Empirical results on production-like clusters show sizable gains: ~23.4% higher P2P throughput, ~28.5% lower P2P latency, and ~6.02% higher training throughput, plus robust operation under RNIC failures (76.6% CC throughput preserved) and detailed runtime observability. The authors also share real-world operating experiences and practical guidance for deploying production-level CC libraries in LLM training, emphasizing resource health, topology-aware channel construction, and efficient monitoring.

Abstract

Large-scale LLM training requires collective communication libraries to exchange data among distributed GPUs. As a company dedicated to building and operating large-scale GPU training clusters, we encounter several challenges when using NCCL in production, including 1) limited efficiency with costly and cumbersome P2P communication, 2) poor tolerance to frequent RNIC port failures, and 3) insufficient observability of transient collective communication anomalies. To address these issues, we propose ICCL, an efficient, reliable, and observable collective communication library in large-scale GPU training clusters. ICCL offloads the P2P communication from GPU kernels to CPU threads for minimal SM consumption, and removes the redundant memory copies irrelevant to the actual communicating process. ICCL also introduces a primary-backup QP mechanism to tolerate frequent NIC port failures, and designs a window-based monitor to observe network anomalies at O(us) level. We open-source ICCL and deploy it in production training clusters for several months, with results showing that compared to NCCL, ICCL achieves a 23.4%/28.5% improvement in P2P throughput/latency as well as a 6.02% increase in training throughput. We also share the operating experience of ICCL in large-scale clusters, hoping to give the communities more insights on production-level collective communication libraries in LLM training.

Paper Structure

This paper contains 20 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: NCCL P2P w/ kernels
  • Figure 2: Overhead breakdown of NCCL P2P.
  • Figure 3: Failure statistics in 2024.
  • Figure 4: ICCL overview.
  • Figure 5: DPDK-like P2P in ICCL.
  • ...and 11 more figures