Table of Contents
Fetching ...

On the Impact of Intra-node Communication in the Performance of Supercomputer and Data Center Interconnection Networks

Joaquin Tarraga-Moreno, Jesus Escudero-Sahuquillo, Pedro Javier Garcia, Francisco J. Quiles

TL;DR

This work tackles the bottleneck created by interference between intra-node and inter-node communications in heterogeneous HPC data centers as accelerators proliferate. It introduces an OMNeT++-based model that jointly simulates intra-node PCIe-like networks and inter-node InfiniBand/ RDMA-like networks, incorporating realistic LLM training traffic patterns (DP/MP with TP/PP) and an overhead-from-packetization analysis. Key findings show that higher intra-node bandwidth and more accelerators per node can paradoxically hurt inter-node performance due to header/payload overhead and congestion at NICs, especially when TP spans nodes. The results offer design guidance for balancing intra- and inter-node resources and underscore the importance of overhead-aware modeling for scalable AI workloads in large-scale HPC/data-center interconnects.

Abstract

In the last decade, specific-purpose computing and storage devices, such as GPUs, TPUs, or high-speed storage, have been incorporated into server nodes of Supercomputers and Data centers. The development of high-bandwidth memory (HBM) enabled a much more compact form factor for these devices, thus allowing the interconnection of several of them within a server node, typically using an intra-node interconnection network (e.g., PCIe, NVLink, or Infinity Fabric). These networks allow scaling up the number of specific computing and storage devices per node. Furthermore, the inter-node networks communicate thousands of these devices placed in different server nodes in a Supercomputer or Data Center. Unfortunately, the intra- and inter-node networks may become the system's bottleneck due to the increasing communication demand among accelerators of applications such as generative AI. Although current intra-node network designs alleviate this bottleneck by increasing the bandwidth of the intra-node network, we show in this paper that such a high bandwidth for intra-node communication may hinder the inter-node communication performance when traffic from outside the node arrives at the intra-node devices, resulting in interference with intra-node traffic. To analyze the impact of this interference, we have studied the communication operations of realistic traffic patterns exploiting intra-node communication. We have developed a generic intra- and inter-node simulation model based on OMNeT++ and modeled the mentioned communication operations. We have also performed extensive simulation experiments that confirm that increasing the intra-node network bandwidth and the number of computing devices per node (i.e., accelerators) is counterproductive to the inter-node communication performance.

On the Impact of Intra-node Communication in the Performance of Supercomputer and Data Center Interconnection Networks

TL;DR

This work tackles the bottleneck created by interference between intra-node and inter-node communications in heterogeneous HPC data centers as accelerators proliferate. It introduces an OMNeT++-based model that jointly simulates intra-node PCIe-like networks and inter-node InfiniBand/ RDMA-like networks, incorporating realistic LLM training traffic patterns (DP/MP with TP/PP) and an overhead-from-packetization analysis. Key findings show that higher intra-node bandwidth and more accelerators per node can paradoxically hurt inter-node performance due to header/payload overhead and congestion at NICs, especially when TP spans nodes. The results offer design guidance for balancing intra- and inter-node resources and underscore the importance of overhead-aware modeling for scalable AI workloads in large-scale HPC/data-center interconnects.

Abstract

In the last decade, specific-purpose computing and storage devices, such as GPUs, TPUs, or high-speed storage, have been incorporated into server nodes of Supercomputers and Data centers. The development of high-bandwidth memory (HBM) enabled a much more compact form factor for these devices, thus allowing the interconnection of several of them within a server node, typically using an intra-node interconnection network (e.g., PCIe, NVLink, or Infinity Fabric). These networks allow scaling up the number of specific computing and storage devices per node. Furthermore, the inter-node networks communicate thousands of these devices placed in different server nodes in a Supercomputer or Data Center. Unfortunately, the intra- and inter-node networks may become the system's bottleneck due to the increasing communication demand among accelerators of applications such as generative AI. Although current intra-node network designs alleviate this bottleneck by increasing the bandwidth of the intra-node network, we show in this paper that such a high bandwidth for intra-node communication may hinder the inter-node communication performance when traffic from outside the node arrives at the intra-node devices, resulting in interference with intra-node traffic. To analyze the impact of this interference, we have studied the communication operations of realistic traffic patterns exploiting intra-node communication. We have developed a generic intra- and inter-node simulation model based on OMNeT++ and modeled the mentioned communication operations. We have also performed extensive simulation experiments that confirm that increasing the intra-node network bandwidth and the number of computing devices per node (i.e., accelerators) is counterproductive to the inter-node communication performance.

Paper Structure

This paper contains 22 sections, 5 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Node configuration in the CELLIA cluster.
  • Figure 2: PCIe cluster maximum payload size (MPS).
  • Figure 3: Generic intra- and inter-node network architecture.
  • Figure 4: Overall Network Throughput versus Traffic Generation Load for different intra-node packets MTUs and traffic patterns in a $32$-node fat-tree inter-node network that communicates $256$ accelerators ($8$ accelerators per node).
  • Figure 5: Packet Latency divided into components versus traffic load for the traffic patterns C4 and C5 with different intra-node MTUs.
  • ...and 11 more figures