Table of Contents
Fetching ...

Nezha: Breaking Multi-Rail Network Barriers for Distributed DNN Training

Enda Yu, Dezun Dong, Xiangke Liao

TL;DR

Nezha tackles the persistent bottlenecks of distributed DNN training on aging, multi-rail networks by delivering a protocol-agnostic allreduce system. It introduces cross-protocol coordination, a protocol-aware dynamic load-balancing strategy, and a fault-tolerant design to seamlessly utilize TCP, SHARP, and GLEX across heterogeneous rails. The framework demonstrates substantial throughput and training-efficiency gains across 8-node clusters and up to 2.36× training efficiency improvements on 128-node GPT-like workloads, outperforming baselines such as Gloo, MPTCP, and MRIB. This work shows that systematic multi-rail optimization can unlock the performance potential of legacy infrastructure, enabling scalable DNN training without hardware upgrades.

Abstract

In distributed deep learning, communication remains a critical bottleneck. While modern hardware advances rapidly, over 60 percent of production HPC systems still rely on legacy infrastructure (V100 GPUs, multi-plane Ethernet/InfiniBand), necessitating communication optimization without hardware upgrades. Existing approaches face three key limitations: (1) static single-rail binding underutilizes multi-rail bandwidth, (2) protocol heterogeneity (TCP-RDMA coexistence) causes synchronization delays, and (3) mainstream libraries (NCCL/MPI) lack cross-protocol coordination. We present Nezha, the first protocol-agnostic system for multi-rail networks. Our contributions include: (1) Hardware-agnostic cross-protocol coordination: A unified abstraction enabling seamless collaboration between in-network computing (SHARP), adaptive RDMA (GLEX), and TCP, achieving 1.7 to 4.3 times lower latency than Gloo. (2) Protocol-aware dynamic load balancing: A hybrid scheduling strategy with cold/hot start state machine for heterogeneous protocols, reducing startup latency for small payloads while enhancing throughput for large transfers. (3) Fault-tolerant multi-rail collaboration: A self-recovery mechanism that reroutes data flows within 200 milliseconds upon single-rail failures, ensuring uninterrupted training. Experiments on 8-node clusters demonstrate Nezha achieves 74 percent and 80 percent higher throughput than MPTCP in homogeneous (TCP-TCP) and heterogeneous (TCP-SHARP) networks, respectively. On 128-node supercomputers, Nezha delivers 2.36 times higher training efficiency than Gloo. By bridging modern DNN communication demands with legacy infrastructure, Nezha proves that systematic multi-rail optimization can unlock the potential of aging clusters.

Nezha: Breaking Multi-Rail Network Barriers for Distributed DNN Training

TL;DR

Nezha tackles the persistent bottlenecks of distributed DNN training on aging, multi-rail networks by delivering a protocol-agnostic allreduce system. It introduces cross-protocol coordination, a protocol-aware dynamic load-balancing strategy, and a fault-tolerant design to seamlessly utilize TCP, SHARP, and GLEX across heterogeneous rails. The framework demonstrates substantial throughput and training-efficiency gains across 8-node clusters and up to 2.36× training efficiency improvements on 128-node GPT-like workloads, outperforming baselines such as Gloo, MPTCP, and MRIB. This work shows that systematic multi-rail optimization can unlock the performance potential of legacy infrastructure, enabling scalable DNN training without hardware upgrades.

Abstract

In distributed deep learning, communication remains a critical bottleneck. While modern hardware advances rapidly, over 60 percent of production HPC systems still rely on legacy infrastructure (V100 GPUs, multi-plane Ethernet/InfiniBand), necessitating communication optimization without hardware upgrades. Existing approaches face three key limitations: (1) static single-rail binding underutilizes multi-rail bandwidth, (2) protocol heterogeneity (TCP-RDMA coexistence) causes synchronization delays, and (3) mainstream libraries (NCCL/MPI) lack cross-protocol coordination. We present Nezha, the first protocol-agnostic system for multi-rail networks. Our contributions include: (1) Hardware-agnostic cross-protocol coordination: A unified abstraction enabling seamless collaboration between in-network computing (SHARP), adaptive RDMA (GLEX), and TCP, achieving 1.7 to 4.3 times lower latency than Gloo. (2) Protocol-aware dynamic load balancing: A hybrid scheduling strategy with cold/hot start state machine for heterogeneous protocols, reducing startup latency for small payloads while enhancing throughput for large transfers. (3) Fault-tolerant multi-rail collaboration: A self-recovery mechanism that reroutes data flows within 200 milliseconds upon single-rail failures, ensuring uninterrupted training. Experiments on 8-node clusters demonstrate Nezha achieves 74 percent and 80 percent higher throughput than MPTCP in homogeneous (TCP-TCP) and heterogeneous (TCP-SHARP) networks, respectively. On 128-node supercomputers, Nezha delivers 2.36 times higher training efficiency than Gloo. By bridging modern DNN communication demands with legacy infrastructure, Nezha proves that systematic multi-rail optimization can unlock the potential of aging clusters.
Paper Structure (38 sections, 8 equations, 19 figures, 3 tables)

This paper contains 38 sections, 8 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Multi-rail networks architectures.
  • Figure 2: Latency and throughput characteristics of GLEX, TCP, and SHARP protocols in allreduce operations across varying data sizes.
  • Figure 3: The impact of real-time efficiency ratio on the throughput improvement ratio of the optimal network.
  • Figure 4: Throughput of allreduce on various single-rail networks bound with different CPU cores.
  • Figure 5: The system architecture of Nezha. Context, Trans., Op., and Ctrl represent the Context, Transport, Collective Operations, and Control modules respectively.
  • ...and 14 more figures