Routing for Large ML Models

Ofir Cohen; Jose Yallouz Michael Schapira; Shahar Belkar; Tal Mizrahi

Routing for Large ML Models

Ofir Cohen, Jose Yallouz Michael Schapira, Shahar Belkar, Tal Mizrahi

TL;DR

The paper tackles routing for very large ML model training by proposing a centralized controller plus host-enforced routing that optimizes a $2$-layered max-min fairness objective over a Clos network. It introduces a fast greedy route-optimization algorithm, augmented with forward-looking traffic predictions and parallelizable runtimes, and leverages RDMA-aware host behavior to distinguish elephant flows. The authors provide theoretical guarantees (a $2$-approximation) and comprehensive simulations showing near-optimal All-Reduce time and robust performance under failures, with favorable runtimes compared to ILP baselines and alternatives. This approach offers practical, scalable improvements for training efficiency in data-center networks without requiring major topology changes, and it highlights the value of integrating control-plane routing with host-level enforcement in RDMA-rich ML workloads.

Abstract

Training large language models (LLMs), and other large machine learning models, involves repeated communication of large volumes of data across a data center network. The communication patterns induced by these training process exhibit high regularity and persistence, giving rise to significant opportunities for optimizing the manner in which flows are routed across the network. We present an algorithmic framework for \textit{quantifying} network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically \textit{optimizing} routing with respect to this global metric.

Routing for Large ML Models

TL;DR

The paper tackles routing for very large ML model training by proposing a centralized controller plus host-enforced routing that optimizes a

-layered max-min fairness objective over a Clos network. It introduces a fast greedy route-optimization algorithm, augmented with forward-looking traffic predictions and parallelizable runtimes, and leverages RDMA-aware host behavior to distinguish elephant flows. The authors provide theoretical guarantees (a

-approximation) and comprehensive simulations showing near-optimal All-Reduce time and robust performance under failures, with favorable runtimes compared to ILP baselines and alternatives. This approach offers practical, scalable improvements for training efficiency in data-center networks without requiring major topology changes, and it highlights the value of integrating control-plane routing with host-level enforcement in RDMA-rich ML workloads.

Routing for Large ML Models

TL;DR

Abstract

Routing for Large ML Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (3)