Understanding the Throughput Bounds of Reconfigurable Datacenter Networks
Vamsi Addanki, Chen Avin, Stefan Schmid
TL;DR
The paper investigates throughput bounds for reconfigurable datacenter networks (RDCNs) under uniform-residual demand patterns and proves a separation: demand-aware RDCNs can achieve at least $\frac{2}{3}$ throughput, which is higher than the oblivious baseline under these conditions. It introduces an integer-residual floor–residual decomposition to bound throughput and demonstrates that periodic fixed-duration reconfigurations can realize most of the gains with low overhead, while providing a rigorous $\frac{2}{3}$ to $\frac{4}{5}$ throughput range for static networks. Empirical validation via linear programming confirms substantial gains of demand-aware periodic designs over demand-oblivious schemes (e.g., up to $49\%$ in ML workloads and $2.4\times$ versus static) and shows the worst-case throughput near $0.8$ largely independent of degree. The work motivates new design directions at the intersection of theory and systems, and outlines open questions for broader demand matrices and practical, real-time switching adjustments.
Abstract
The increasing gap between the growth of datacenter traffic volume and the capacity of electrical switches led to the emergence of reconfigurable datacenter network designs based on optical circuit switching. A multitude of research works, ranging from demand-oblivious (e.g., RotorNet, Sirius) to demand-aware (e.g., Helios, ProjecToR) reconfigurable networks, demonstrate significant performance benefits. Unfortunately, little is formally known about the achievable throughput of such networks. Only recently have the throughput bounds of demand-oblivious networks been studied. In this paper, we tackle a fundamental question: Whether and to what extent can demand-aware reconfigurable networks improve the throughput of datacenters? This paper attempts to understand the landscape of the throughput bounds of reconfigurable datacenter networks. Given the rise of machine learning workloads and collective communication in modern datacenters, we specifically focus on their typical communication patterns, namely uniform-residual demand matrices. We formally establish a separation bound of demand-aware networks over demand-oblivious networks, proving analytically that the former can provide at least $16\%$ higher throughput. Our analysis further uncovers new design opportunities based on periodic, fixed-duration reconfigurations that can harness the throughput benefits of demand-aware networks while inheriting the simplicity and low reconfiguration overheads of demand-oblivious networks. Finally, our evaluations corroborate the theoretical results of this paper, demonstrating that demand-aware networks significantly outperform oblivious networks in terms of throughput. This work barely scratches the surface and unveils several intriguing open questions, which we discuss at the end of this paper.
