Table of Contents
Fetching ...

Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters

Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani

TL;DR

This work tackles the rising cost and latency of hyperscale LLM training networks by analyzing LLM traffic patterns and showing that high-bandwidth cross-GPU connectivity is concentrated within HB domains and rails, not across all GPUs. It introduces Rail-only, a spineless interconnect that preserves full connectivity within rails via per-rail Clos networks while removing the spine, enabling routing with minimal overhead and low fault exposure. The authors provide an analytic iteration-time model, HB-domain sizing guidance, and a cost/power analysis demonstrating 38%–77% network-cost reductions and 37%–75% energy savings, with only 8.2%–11.2% overhead for MoE all-to-all traffic. The design is shown to be practical for both standard LLMs and MoE variants, offering substantial real-world impact for deploying large-scale, energy-efficient training clusters.

Abstract

This paper presents a low-cost network architecture for training large language models (LLMs) at hyperscale. We study the optimal parallelization strategy of LLMs and propose a novel datacenter network design tailored to LLM's unique communication pattern. We show that LLM training generates sparse communication patterns in the network and, therefore, does not require any-to-any full-bisection network to complete efficiently. As a result, our design eliminates the spine layer in traditional GPU clusters. We name this design a Rail-only network and demonstrate that it achieves the same training performance while reducing the network cost by 38% to 77% and network power consumption by 37% to 75% compared to a conventional GPU datacenter. Our architecture also supports Mixture-of-Expert (MoE) models with all-to-all communication through forwarding, with only 8.2% to 11.2% completion time overhead for all-to-all traffic. We study the failure robustness of Rail-only networks and provide insights into the performance impact of different network and training parameters.

Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters

TL;DR

This work tackles the rising cost and latency of hyperscale LLM training networks by analyzing LLM traffic patterns and showing that high-bandwidth cross-GPU connectivity is concentrated within HB domains and rails, not across all GPUs. It introduces Rail-only, a spineless interconnect that preserves full connectivity within rails via per-rail Clos networks while removing the spine, enabling routing with minimal overhead and low fault exposure. The authors provide an analytic iteration-time model, HB-domain sizing guidance, and a cost/power analysis demonstrating 38%–77% network-cost reductions and 37%–75% energy savings, with only 8.2%–11.2% overhead for MoE all-to-all traffic. The design is shown to be practical for both standard LLMs and MoE variants, offering substantial real-world impact for deploying large-scale, energy-efficient training clusters.

Abstract

This paper presents a low-cost network architecture for training large language models (LLMs) at hyperscale. We study the optimal parallelization strategy of LLMs and propose a novel datacenter network design tailored to LLM's unique communication pattern. We show that LLM training generates sparse communication patterns in the network and, therefore, does not require any-to-any full-bisection network to complete efficiently. As a result, our design eliminates the spine layer in traditional GPU clusters. We name this design a Rail-only network and demonstrate that it achieves the same training performance while reducing the network cost by 38% to 77% and network power consumption by 37% to 75% compared to a conventional GPU datacenter. Our architecture also supports Mixture-of-Expert (MoE) models with all-to-all communication through forwarding, with only 8.2% to 11.2% completion time overhead for all-to-all traffic. We study the failure robustness of Rail-only networks and provide insights into the performance impact of different network and training parameters.
Paper Structure (20 sections, 5 equations, 10 figures, 2 tables)

This paper contains 20 sections, 5 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: A GPU datacenter with Rail-optimized, any-to-any Clos networks dgxh100archdoc.
  • Figure 2: (a) The traffic volume from different parallelization dimensions; (b) The communication type across all GPU pairs.
  • Figure 3: Traffic heatmaps for GPT-1T in MegatronLM narayanan2021efficient. Highlights show GPUs in the same HB domains and rails.
  • Figure 4: Traffic distribution and heatmaps for GPT-1T, distributed on 16 DGX GH200s. Note that DP (NIC) accounts for 0.8% of the total traffic percentage. The "Same-Rail" legend on Figure \ref{['fig:traffic_dist_gh200']} appears for GPUs whose ranks are 256 apart.
  • Figure 5: Traffic volume distribution and heatmap for the MoE-1.3B model in DeepSpeedMoE rajbhandari2022deepspeedmoe, assuming uniform token distribution.
  • ...and 5 more figures