Table of Contents
Fetching ...

When Light Bends to the Collective Will: A Theory and Vision for Adaptive Photonic Scale-up Domains

Vamsi Addanki

TL;DR

Addresses the bottleneck of collective communication in scale-up domains by integrating adaptive photonic interconnects into a principled framework. The core idea is to express each collective step as a matching, decompose the aggregate demand matrix with a Birkhoff–von Neumann decomposition, and connect throughput to the maximum concurrent flow through the $\alpha$-$\beta$ model, yielding the step-level completion time $DCT(m_i\cdot\mathcal{M}_i)=\alpha+\delta\cdot\ell_i+\beta\cdot m_i/\theta(G,\mathcal{M}_i)$ and total time. The paper then formulates a mixed‑integer program to decide when to reconfigure, balancing reconfiguration delay $\alpha_r$ against congestion and propagation, and demonstrates regime‑dependent gains via a flow‑level simulator. It also outlines a research agenda for fast heuristics, simplified congestion metrics, routing in dynamic topologies, and practical photonic integration, pointing toward a future where light bends to the collective will.

Abstract

As chip-to-chip silicon photonics gain traction for their bandwidth and energy efficiency, collective communication has emerged as a critical bottleneck in scale-up systems. Programmable photonic interconnects offer a promising path forward: by dynamically reconfiguring the fabric, they can establish direct, high-bandwidth optical paths between communicating endpoints -- \emph{synchronously and guided by the structure of collective operations} (e.g., AllReduce). However, realizing this vision -- \emph{when light bends to the collective will} -- requires navigating a fundamental trade-off between reconfiguration delay and the performance gains of adaptive topologies. In this paper, we present a simple theoretical framework for adaptive photonic scale-up domains that makes this trade-off explicit and clarifies when reconfiguration is worthwhile. Along the way, we highlight a connection -- not surprising but still powerful -- between the Birkhoff--von Neumann (BvN) decomposition, maximum concurrent flow (a classic measure of network throughput), and the well-known $α$-$β$ cost model for collectives. Finally, we outline a research agenda in algorithm design and systems integration that can build on this foundation.

When Light Bends to the Collective Will: A Theory and Vision for Adaptive Photonic Scale-up Domains

TL;DR

Addresses the bottleneck of collective communication in scale-up domains by integrating adaptive photonic interconnects into a principled framework. The core idea is to express each collective step as a matching, decompose the aggregate demand matrix with a Birkhoff–von Neumann decomposition, and connect throughput to the maximum concurrent flow through the - model, yielding the step-level completion time and total time. The paper then formulates a mixed‑integer program to decide when to reconfigure, balancing reconfiguration delay against congestion and propagation, and demonstrates regime‑dependent gains via a flow‑level simulator. It also outlines a research agenda for fast heuristics, simplified congestion metrics, routing in dynamic topologies, and practical photonic integration, pointing toward a future where light bends to the collective will.

Abstract

As chip-to-chip silicon photonics gain traction for their bandwidth and energy efficiency, collective communication has emerged as a critical bottleneck in scale-up systems. Programmable photonic interconnects offer a promising path forward: by dynamically reconfiguring the fabric, they can establish direct, high-bandwidth optical paths between communicating endpoints -- \emph{synchronously and guided by the structure of collective operations} (e.g., AllReduce). However, realizing this vision -- \emph{when light bends to the collective will} -- requires navigating a fundamental trade-off between reconfiguration delay and the performance gains of adaptive topologies. In this paper, we present a simple theoretical framework for adaptive photonic scale-up domains that makes this trade-off explicit and clarifies when reconfiguration is worthwhile. Along the way, we highlight a connection -- not surprising but still powerful -- between the Birkhoff--von Neumann (BvN) decomposition, maximum concurrent flow (a classic measure of network throughput), and the well-known - cost model for collectives. Finally, we outline a research agenda in algorithm design and systems integration that can build on this foundation.

Paper Structure

This paper contains 8 sections, 8 equations, 2 figures.

Figures (2)

  • Figure 1: Heatmaps showing the speedup in collective completion times achieved by our optimized schedules, compared to BvN-based schedules (top row) and a static ring topology (bottom row).
  • Figure 2: Our optimized schedules can significantly speed up collective communication even compared to the best of both worlds --- BvN schedules and a static ring topology.