On the Burstiness of Distributed Machine Learning Traffic

Natchanon Luangsomboon; Fahimeh Fazel; Jörg Liebeherr; Ashkan Sobhani; Shichao Guan; Xingjun Chu

On the Burstiness of Distributed Machine Learning Traffic

Natchanon Luangsomboon, Fahimeh Fazel, Jörg Liebeherr, Ashkan Sobhani, Shichao Guan, Xingjun Chu

TL;DR

This work analyzes the burstiness of traffic generated by distributed ML training, focusing on microbursts in data center networks. It introduces network-calculus based metrics, including the burstiness curve ${\mathcal{E}}_A(\tau)$, peak-to-mean ${\rm PtM}(\tau)$, and the backlog measure $B_{\rm max}(r)$, to quantify bursts across time scales. Key findings show extreme short-term burstiness, with peak-to-mean ratios up to $60:1$ at 5 ms, and that intra-application bursts are mitigated by coordinated gradient exchanges (server-based Allreduce and Ring Allreduce), while cross-application fan-in remains a congestion risk. ns-3 simulations of cross-application ML traffic reveal challenges for congestion control like DCQCN, highlighting the need for topology-aware designs and alternative gradient-aggregation methods to better handle concurrent bursts. These insights provide reference points for future work on DCN topology, accelerators, and gradient aggregation schemes to reduce microbursts in distributed ML workloads.

Abstract

Traffic from distributed training of machine learning (ML) models makes up a large and growing fraction of the traffic mix in enterprise data centers. While work on distributed ML abounds, the network traffic generated by distributed ML has received little attention. Using measurements on a testbed network, we investigate the traffic characteristics generated by the training of the ResNet-50 neural network with an emphasis on studying its short-term burstiness. For the latter we propose metrics that quantify traffic burstiness at different time scales. Our analysis reveals that distributed ML traffic exhibits a very high degree of burstiness on short time scales, exceeding a 60:1 peak-to-mean ratio on time intervals as long as 5~ms. We observe that training software orchestrates transmissions in such a way that burst transmissions from different sources within the same application do not result in congestion and packet losses. An extrapolation of the measurement data to multiple applications underscores the challenges of distributed ML traffic for congestion and flow control algorithms.

On the Burstiness of Distributed Machine Learning Traffic

TL;DR

, peak-to-mean

, and the backlog measure

, to quantify bursts across time scales. Key findings show extreme short-term burstiness, with peak-to-mean ratios up to

at 5 ms, and that intra-application bursts are mitigated by coordinated gradient exchanges (server-based Allreduce and Ring Allreduce), while cross-application fan-in remains a congestion risk. ns-3 simulations of cross-application ML traffic reveal challenges for congestion control like DCQCN, highlighting the need for topology-aware designs and alternative gradient-aggregation methods to better handle concurrent bursts. These insights provide reference points for future work on DCN topology, accelerators, and gradient aggregation schemes to reduce microbursts in distributed ML workloads.

Abstract

Paper Structure (20 sections, 2 theorems, 11 equations, 14 figures, 1 table)

This paper contains 20 sections, 2 theorems, 11 equations, 14 figures, 1 table.

Introduction
Metrics for Traffic Burstiness
Network Calculus Background
Burstiness Metrics
Examples
Distributed ML Experiments
Distributed Training of DNNs
Measurement Testbed
Server-based Training
Traffic of Linear-Allreduce
Burstiness Metrics
Serverless Training
Traffic of Ring-Allreduce
Burstiness Metrics
Burstiness Potential of Distributed ML Traffic
...and 5 more sections

Key Result

Lemma 1

The maximum backlog of traffic with an arrival function $A$ at a network element with exact service curve $S$ satisfies $B_{\rm max} = {\mathcal{E}}_A \oslash S(0)$.

Figures (14)

Figure 1: Traffic rates of DCN traffic trace with and without large rate spike.
Figure 4: Workflow of DNN model training.
Figure 5: Testbed network for traffic measurements.
Figure 6: Linear-Allreduce: Traffic from all workers to server (10 s).
Figure 7: Linear-Allreduce: Burstiness metrics for traffic from one worker (Worker3) to the server.
...and 9 more figures

Theorems & Definitions (4)

Lemma 1
proof
Corollary 1
proof

On the Burstiness of Distributed Machine Learning Traffic

TL;DR

Abstract

On the Burstiness of Distributed Machine Learning Traffic

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (4)