MLTCP: Congestion Control for DNN Training

Sudarsanan Rajasekaran; Sanjoli Narang; Anton A. Zabreyko; Manya Ghobadi

MLTCP: Congestion Control for DNN Training

Sudarsanan Rajasekaran, Sanjoli Narang, Anton A. Zabreyko, Manya Ghobadi

TL;DR

MLTCP addresses cross-job network contention in shared GPU clusters during DNN training by enabling inter-job communication interleaving through a lightweight augmentation to existing congestion controls. It biases per-flow aggressiveness with a bandwidth function $F(bytes_ratio)$, where $bytes_ratio = \frac{bytes_sent}{total_bytes}$, to approximate SRPT scheduling and drive flows toward a stable interleaved state. The design introduces a job favoritism policy and a linear aggressiveness function that can be applied to Reno, CUBIC, and DCQCN with only about 30–60 lines of code per algorithm, and it demonstrates robustness to stragglers and partial compatibility while achieving up to 2x average and 4x tail training iteration speedups. The practical impact is a distributed, scalable solution that enables automatic interleaving in real clusters without centralized control, with a kernel-level implementation and publicly available evaluation scripts.

Abstract

We present MLTCP, a technique to augment today's congestion control algorithms to accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication phases of jobs that compete for network bandwidth to interleave with each other, thereby utilizing the network efficiently. At the heart of MLTCP lies a very simple principle based on a key conceptual insight: DNN training flows should scale their congestion window size based on the number of bytes sent at each training iteration. We show that integrating this principle into today's congestion control protocols is straightforward: by adding 30-60 lines of code to Reno, CUBIC, or DCQCN, MLTCP stabilizes flows of different jobs into an interleaved state within a few training iterations, regardless of the number of competing flows or the start time of each flow. Our experiments with popular DNN training jobs demonstrate that enabling MLTCP accelerates the average and 99th percentile training iteration time by up to 2x and 4x, respectively.

MLTCP: Congestion Control for DNN Training

TL;DR

, where

, to approximate SRPT scheduling and drive flows toward a stable interleaved state. The design introduces a job favoritism policy and a linear aggressiveness function that can be applied to Reno, CUBIC, and DCQCN with only about 30–60 lines of code per algorithm, and it demonstrates robustness to stragglers and partial compatibility while achieving up to 2x average and 4x tail training iteration speedups. The practical impact is a distributed, scalable solution that enables automatic interleaving in real clusters without centralized control, with a kernel-level implementation and publicly available evaluation scripts.

Abstract

Paper Structure (22 sections, 15 equations, 16 figures, 1 algorithm)

This paper contains 22 sections, 15 equations, 16 figures, 1 algorithm.

Introduction
Communication Interleaving
Background
Challenges of Comm. Interleaving
MLTCP Design
Goals and High-Level Concept
Job Favoritism Policy
Bandwidth Aggressiveness Function
Augmenting Reno, CUBIC, & DCQCN
Updating MLTCP Parameters
Evaluation
Evaluation Methodology
Convergence Benchmarks
Training Iteration Time Speedup
Impact of DNN Model & Parallelization Strategy Diversity
...and 7 more sections

Figures (16)

Figure 1: Inter-job communication interleaving. Figure adapted from muricassini_hotnetscassini_nsdi.
Figure 2: Circular dependency between jobs and links.
Figure 3: MLTCP high-level concept.
Figure 4: (a) Favoring Job$_{1}$ interleaves the jobs. (b) Favoring Job$_{2}$ overlaps the jobs.
Figure 5: Potential bandwidth aggressiveness functions for adjusting the cwnd (or rate).
...and 11 more figures

MLTCP: Congestion Control for DNN Training

TL;DR

Abstract

MLTCP: Congestion Control for DNN Training

Authors

TL;DR

Abstract

Table of Contents

Figures (16)