When Light Bends to the Collective Will: A Theory and Vision for Adaptive Photonic Scale-up Domains
Vamsi Addanki
TL;DR
Addresses the bottleneck of collective communication in scale-up domains by integrating adaptive photonic interconnects into a principled framework. The core idea is to express each collective step as a matching, decompose the aggregate demand matrix with a Birkhoff–von Neumann decomposition, and connect throughput to the maximum concurrent flow through the $\alpha$-$\beta$ model, yielding the step-level completion time $DCT(m_i\cdot\mathcal{M}_i)=\alpha+\delta\cdot\ell_i+\beta\cdot m_i/\theta(G,\mathcal{M}_i)$ and total time. The paper then formulates a mixed‑integer program to decide when to reconfigure, balancing reconfiguration delay $\alpha_r$ against congestion and propagation, and demonstrates regime‑dependent gains via a flow‑level simulator. It also outlines a research agenda for fast heuristics, simplified congestion metrics, routing in dynamic topologies, and practical photonic integration, pointing toward a future where light bends to the collective will.
Abstract
As chip-to-chip silicon photonics gain traction for their bandwidth and energy efficiency, collective communication has emerged as a critical bottleneck in scale-up systems. Programmable photonic interconnects offer a promising path forward: by dynamically reconfiguring the fabric, they can establish direct, high-bandwidth optical paths between communicating endpoints -- \emph{synchronously and guided by the structure of collective operations} (e.g., AllReduce). However, realizing this vision -- \emph{when light bends to the collective will} -- requires navigating a fundamental trade-off between reconfiguration delay and the performance gains of adaptive topologies. In this paper, we present a simple theoretical framework for adaptive photonic scale-up domains that makes this trade-off explicit and clarifies when reconfiguration is worthwhile. Along the way, we highlight a connection -- not surprising but still powerful -- between the Birkhoff--von Neumann (BvN) decomposition, maximum concurrent flow (a classic measure of network throughput), and the well-known $α$-$β$ cost model for collectives. Finally, we outline a research agenda in algorithm design and systems integration that can build on this foundation.
