Table of Contents
Fetching ...

Routing for Large ML Models

Ofir Cohen, Jose Yallouz Michael Schapira, Shahar Belkar, Tal Mizrahi

TL;DR

The paper tackles routing for very large ML model training by proposing a centralized controller plus host-enforced routing that optimizes a $2$-layered max-min fairness objective over a Clos network. It introduces a fast greedy route-optimization algorithm, augmented with forward-looking traffic predictions and parallelizable runtimes, and leverages RDMA-aware host behavior to distinguish elephant flows. The authors provide theoretical guarantees (a $2$-approximation) and comprehensive simulations showing near-optimal All-Reduce time and robust performance under failures, with favorable runtimes compared to ILP baselines and alternatives. This approach offers practical, scalable improvements for training efficiency in data-center networks without requiring major topology changes, and it highlights the value of integrating control-plane routing with host-level enforcement in RDMA-rich ML workloads.

Abstract

Training large language models (LLMs), and other large machine learning models, involves repeated communication of large volumes of data across a data center network. The communication patterns induced by these training process exhibit high regularity and persistence, giving rise to significant opportunities for optimizing the manner in which flows are routed across the network. We present an algorithmic framework for \textit{quantifying} network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically \textit{optimizing} routing with respect to this global metric.

Routing for Large ML Models

TL;DR

The paper tackles routing for very large ML model training by proposing a centralized controller plus host-enforced routing that optimizes a -layered max-min fairness objective over a Clos network. It introduces a fast greedy route-optimization algorithm, augmented with forward-looking traffic predictions and parallelizable runtimes, and leverages RDMA-aware host behavior to distinguish elephant flows. The authors provide theoretical guarantees (a -approximation) and comprehensive simulations showing near-optimal All-Reduce time and robust performance under failures, with favorable runtimes compared to ILP baselines and alternatives. This approach offers practical, scalable improvements for training efficiency in data-center networks without requiring major topology changes, and it highlights the value of integrating control-plane routing with host-level enforcement in RDMA-rich ML workloads.

Abstract

Training large language models (LLMs), and other large machine learning models, involves repeated communication of large volumes of data across a data center network. The communication patterns induced by these training process exhibit high regularity and persistence, giving rise to significant opportunities for optimizing the manner in which flows are routed across the network. We present an algorithmic framework for \textit{quantifying} network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically \textit{optimizing} routing with respect to this global metric.

Paper Structure

This paper contains 30 sections, 2 theorems, 1 equation, 9 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

$Greedy$ provides a $2$-approximation to the $2$-layered max-min fairness objective in directed $2$-layer Clos networks.

Figures (9)

  • Figure 1: Example of Ring All-Reduce with 4 nodes
  • Figure 2: Simple 2-Layer Clos network example
  • Figure 3: Hosts send RDMA buffer occupancy information and the controller sends path assignments to hosts/GPUs.
  • Figure 4: All-Reduce (Parameter Synchronization) time of Bloom workshop2023bloom. In figures \ref{['fig:all-reduce-time-ring-2']} - \ref{['fig:all-reduce-time-ring-8']} we vary the number of concurrent jobs submitted in the cluster on the x-axis, and in figures \ref{['fig:all-reduce-time-1-jobs']} - \ref{['fig:all-reduce-time-5-jobs']} we vary the number of All-Reduce ring size on the x-axis.
  • Figure 5: Flow Completion Time (FCT) of various architectures. In figures \ref{['fig:fct-bloom-ring-2']} - \ref{['fig:fct-gpt3-ring-8']} we vary the number of concurrent jobs submitted in the cluster on the x-axis.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • proof