Table of Contents
Fetching ...

Understanding the Throughput Bounds of Reconfigurable Datacenter Networks

Vamsi Addanki, Chen Avin, Stefan Schmid

TL;DR

The paper investigates throughput bounds for reconfigurable datacenter networks (RDCNs) under uniform-residual demand patterns and proves a separation: demand-aware RDCNs can achieve at least $\frac{2}{3}$ throughput, which is higher than the oblivious baseline under these conditions. It introduces an integer-residual floor–residual decomposition to bound throughput and demonstrates that periodic fixed-duration reconfigurations can realize most of the gains with low overhead, while providing a rigorous $\frac{2}{3}$ to $\frac{4}{5}$ throughput range for static networks. Empirical validation via linear programming confirms substantial gains of demand-aware periodic designs over demand-oblivious schemes (e.g., up to $49\%$ in ML workloads and $2.4\times$ versus static) and shows the worst-case throughput near $0.8$ largely independent of degree. The work motivates new design directions at the intersection of theory and systems, and outlines open questions for broader demand matrices and practical, real-time switching adjustments.

Abstract

The increasing gap between the growth of datacenter traffic volume and the capacity of electrical switches led to the emergence of reconfigurable datacenter network designs based on optical circuit switching. A multitude of research works, ranging from demand-oblivious (e.g., RotorNet, Sirius) to demand-aware (e.g., Helios, ProjecToR) reconfigurable networks, demonstrate significant performance benefits. Unfortunately, little is formally known about the achievable throughput of such networks. Only recently have the throughput bounds of demand-oblivious networks been studied. In this paper, we tackle a fundamental question: Whether and to what extent can demand-aware reconfigurable networks improve the throughput of datacenters? This paper attempts to understand the landscape of the throughput bounds of reconfigurable datacenter networks. Given the rise of machine learning workloads and collective communication in modern datacenters, we specifically focus on their typical communication patterns, namely uniform-residual demand matrices. We formally establish a separation bound of demand-aware networks over demand-oblivious networks, proving analytically that the former can provide at least $16\%$ higher throughput. Our analysis further uncovers new design opportunities based on periodic, fixed-duration reconfigurations that can harness the throughput benefits of demand-aware networks while inheriting the simplicity and low reconfiguration overheads of demand-oblivious networks. Finally, our evaluations corroborate the theoretical results of this paper, demonstrating that demand-aware networks significantly outperform oblivious networks in terms of throughput. This work barely scratches the surface and unveils several intriguing open questions, which we discuss at the end of this paper.

Understanding the Throughput Bounds of Reconfigurable Datacenter Networks

TL;DR

The paper investigates throughput bounds for reconfigurable datacenter networks (RDCNs) under uniform-residual demand patterns and proves a separation: demand-aware RDCNs can achieve at least throughput, which is higher than the oblivious baseline under these conditions. It introduces an integer-residual floor–residual decomposition to bound throughput and demonstrates that periodic fixed-duration reconfigurations can realize most of the gains with low overhead, while providing a rigorous to throughput range for static networks. Empirical validation via linear programming confirms substantial gains of demand-aware periodic designs over demand-oblivious schemes (e.g., up to in ML workloads and versus static) and shows the worst-case throughput near largely independent of degree. The work motivates new design directions at the intersection of theory and systems, and outlines open questions for broader demand matrices and practical, real-time switching adjustments.

Abstract

The increasing gap between the growth of datacenter traffic volume and the capacity of electrical switches led to the emergence of reconfigurable datacenter network designs based on optical circuit switching. A multitude of research works, ranging from demand-oblivious (e.g., RotorNet, Sirius) to demand-aware (e.g., Helios, ProjecToR) reconfigurable networks, demonstrate significant performance benefits. Unfortunately, little is formally known about the achievable throughput of such networks. Only recently have the throughput bounds of demand-oblivious networks been studied. In this paper, we tackle a fundamental question: Whether and to what extent can demand-aware reconfigurable networks improve the throughput of datacenters? This paper attempts to understand the landscape of the throughput bounds of reconfigurable datacenter networks. Given the rise of machine learning workloads and collective communication in modern datacenters, we specifically focus on their typical communication patterns, namely uniform-residual demand matrices. We formally establish a separation bound of demand-aware networks over demand-oblivious networks, proving analytically that the former can provide at least higher throughput. Our analysis further uncovers new design opportunities based on periodic, fixed-duration reconfigurations that can harness the throughput benefits of demand-aware networks while inheriting the simplicity and low reconfiguration overheads of demand-oblivious networks. Finally, our evaluations corroborate the theoretical results of this paper, demonstrating that demand-aware networks significantly outperform oblivious networks in terms of throughput. This work barely scratches the surface and unveils several intriguing open questions, which we discuss at the end of this paper.
Paper Structure (21 sections, 10 theorems, 3 equations, 7 figures)

This paper contains 21 sections, 10 theorems, 3 equations, 7 figures.

Key Result

Theorem 1

The throughput of a demand-aware reconfigurable network is $1$ (full-throughput), specifically for those demand matrices for which the normalized demand matrix (normalized by link capacity) is equal to the corresponding floor matrix.

Figures (7)

  • Figure 1: The landscape of throughput bounds for reconfigurable datacenter networks under uniformly skewed communication patterns: while prior works show a tight bound close to $\frac{1}{2}$ for demand-oblivious networks, we show the first separation result i.e., demand-aware networks are strictly better in terms of throughput. Even simple demand-aware networks based on periodic fixed-duration reconfigurations (similar to RotorNet & Sirius) can achieve at least $16\%$ better throughput in the worst-case.
  • Figure 2: Physical topology of a reconfigurable datacenter network.
  • Figure 3: The demand matrices of emerging Machine Learning workloads, particularly DNN training workloads, exhibit excellent structure: (i) the decomposed floor and residual matrices are mostly regular (in terms of the sum of rows and columns); (ii) the floor matrices are typically close to a permutation matrix and carry majority of the traffic, typically $>75\%$ in each row and column; (iii) the residual matrices typically carry very low traffic, typically $<25\%$ in each row and column. The color of each entry (cell) in the heatmaps (demand matrices) indicates the demand specified by the entry as the minimum percentage of the corresponding total source (row) demand and the corresponding total destination (column) demand.
  • Figure 4: The throughput of demand-aware periodic networks is strictly superior to demand-oblivious and static networks across all demand matrices. Demand-aware static performs poorly under synthetic demand matrices due to low degree, but it outperforms demand-oblivious for DNN training demand matrices (last four on the right). Worst-cases for each network are indicated by $\star$.
  • Figure 5: The worst-case throughput of demand-aware periodic is independent of degree and $30\%$ greater than that of demand-oblivious. The throughput of demand-aware static is dependent on degree but close to demand-oblivious even at low degree.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Definition 1: Demand matrix
  • Definition 2: Throughput
  • Definition 3: Integer-residual decomposition
  • Definition 4: Uniform-residual demand matrix
  • Theorem 1: Throughput under integer demand matrices
  • Theorem 2: Ideal throughput of demand-aware RDCN
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 3: Lower bound for demand-aware static RDCNs
  • ...and 4 more