Table of Contents
Fetching ...

FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters

Hasibul Jamil, Abdul Alim, Laurent Schares, Pavlos Maniotis, Liran Schour, Ali Sydney, Abdullah Kayi, Tevfik Kosar, Bengi Karacali

TL;DR

FlowTracer can provide detailed insights into traffic distribution and can help identify the root causes of performance degradation, such as hash collisions, and system operators can optimize routing, reduce congestion, and improve the performance of distributed AI workloads.

Abstract

The increasing complexity of AI workloads, especially distributed Large Language Model (LLM) training, places significant strain on the networking infrastructure of parallel data centers and supercomputing systems. While Equal-Cost Multi- Path (ECMP) routing distributes traffic over parallel paths, hash collisions often lead to imbalanced network resource utilization and performance bottlenecks. This paper presents FlowTracer, a tool designed to analyze network path utilization and evaluate different routing strategies. FlowTracer aids in debugging network inefficiencies by providing detailed visibility into traffic distribution and helping to identify the root causes of performance degradation, such as issues caused by hash collisions. By offering flow-level insights, FlowTracer enables system operators to optimize routing, reduce congestion, and improve the performance of distributed AI workloads. We use a RoCEv2-enabled cluster with a leaf-spine network and 16 400-Gbps nodes to demonstrate how FlowTracer can be used to compare the flow imbalances of ECMP routing against a statically configured network. The example showcases a 30% reduction in imbalance, as measured by a new metric we introduce.

FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters

TL;DR

FlowTracer can provide detailed insights into traffic distribution and can help identify the root causes of performance degradation, such as hash collisions, and system operators can optimize routing, reduce congestion, and improve the performance of distributed AI workloads.

Abstract

The increasing complexity of AI workloads, especially distributed Large Language Model (LLM) training, places significant strain on the networking infrastructure of parallel data centers and supercomputing systems. While Equal-Cost Multi- Path (ECMP) routing distributes traffic over parallel paths, hash collisions often lead to imbalanced network resource utilization and performance bottlenecks. This paper presents FlowTracer, a tool designed to analyze network path utilization and evaluate different routing strategies. FlowTracer aids in debugging network inefficiencies by providing detailed visibility into traffic distribution and helping to identify the root causes of performance degradation, such as issues caused by hash collisions. By offering flow-level insights, FlowTracer enables system operators to optimize routing, reduce congestion, and improve the performance of distributed AI workloads. We use a RoCEv2-enabled cluster with a leaf-spine network and 16 400-Gbps nodes to demonstrate how FlowTracer can be used to compare the flow imbalances of ECMP routing against a statically configured network. The example showcases a 30% reduction in imbalance, as measured by a new metric we introduce.

Paper Structure

This paper contains 14 sections, 1 equation, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: FlowTracer's architecture and the process of hop-by-hop path discovery for all flows.
  • Figure 2: (a) Experimental 2-rack testbed consisting of 16 servers and 8 switches. Each server has two dual-port 100-Gb/s NICs, i.e., a total bandwidth of 400 Gb/s per server. The network contains four spine and four leaf switches, with a 1.6-Tb/s cross-rack bandwidth between the leaf and spine layers. (b): bipartite traffic pattern between servers in the two racks that are used in our evaluation.
  • Figure 3: (a) RoCE throughput distribution across all node pairs and their corresponding imbalance metric shown in red (lower is better). (b) flow distributions of 256 RoCE flows for standard ECMP-based routing. (c) flow distributions for preprogrammed static routing. The red line in each subfigure represents the ideal flow distribution, i.e., all flows are balanced perfectly. Standard ECMP routing results in a noticeable load imbalance, while static routing provides much more balanced distributions.
  • Figure 4: Comparison of completion time versus the number of flows for different numbers of parallel threads (2, 4, and 8 threads).
  • Figure 5: Completion time versus number of flows for three different SSH connection approaches with remote devices: Baseline, Persistent, and Parallel+Persistent. The Baseline configuration represents ad-hoc SSH connections that are immediately terminated after the path discovery. The Persistent configuration uses a single SSH connection for multiple path discoveries, leading to reduced completion times compared to Baseline. The Parallel+Persistent configuration leverages multiple persistent SSH connections managed by parallel threads, resulting in the fastest completion times.