Table of Contents
Fetching ...

Harvest: Adaptive Photonic Switching Schedules for Collective Communication in Scale-up Domains

Mahir Rahman, Samuel Joseph, Nihar Kodkani, Behnaz Arzani, Vamsi Addanki

TL;DR

Harvest tackles the problem of optimizing reconfigurable photonic interconnects for collective GPU communication by balancing reconfiguration delay $\alpha_r$ with congestion and propagation delays. It abstracts the schedule of a collective into step-wise communication patterns and uses a dynamic-programming framework combined with a topology-optimization subproblem to synthesize when and how to reconfigure. The key contributions include a formal model linking $\alpha$–$\beta$ cost, maximum concurrent flow, and reconfiguration costs; a general Harvest framework applicable to arbitrary collectives; and a polylogarithmic-time optimal schedule for Recursive Doubling AllReduce, plus extensive simulation and hardware-emulation validation showing substantial performance gains over static and per-step reconfiguration baselines. The work enables practical, offline synthesis of switching schedules that adapt to photonic technology parameters and demonstrates meaningful improvements in collective completion time with manageable synthesis overhead, offering a path toward adaptive photonic scale-up domains.

Abstract

As chip-to-chip silicon photonics gain traction for their bandwidth and energy efficiency, their circuit-switched nature raises a fundamental question for collective communication: when and how should the interconnect be reconfigured to realize these benefits? Establishing direct optical paths can reduce congestion and propagation delay, but each reconfiguration incurs non-negligible overhead, making naive per-step reconfiguration impractical. We present Harvest, a systematic approach for synthesizing topology reconfiguration schedules that minimize collective completion time in photonic interconnects. Given a collective communication algorithm and its fixed communication schedule, Harvest determines how the interconnect should evolve over the course of the collective, explicitly balancing reconfiguration delay against congestion and propagation delay. We reduce the synthesis problem into a dynamic program with an underlying topology optimization subproblem and show that the approach applies to arbitrary collective communication algorithms. Furthermore, we exploit the algorithmic structure of a well-known AllReduce algorithm (Recursive Doubling) to synthesize optimal reconfiguration schedules without using any optimizers. By parameterizing the formulation using reconfiguration delay, Harvest naturally adapts to various photonic technologies. Using packet-level and flow-level evaluations, as well as hardware emulation on commercial GPUs, we show that the schedules synthesized by Harvest significantly reduce collective completion time across multiple collective algorithms compared to static interconnects and reconfigure-every-step baselines.

Harvest: Adaptive Photonic Switching Schedules for Collective Communication in Scale-up Domains

TL;DR

Harvest tackles the problem of optimizing reconfigurable photonic interconnects for collective GPU communication by balancing reconfiguration delay with congestion and propagation delays. It abstracts the schedule of a collective into step-wise communication patterns and uses a dynamic-programming framework combined with a topology-optimization subproblem to synthesize when and how to reconfigure. The key contributions include a formal model linking cost, maximum concurrent flow, and reconfiguration costs; a general Harvest framework applicable to arbitrary collectives; and a polylogarithmic-time optimal schedule for Recursive Doubling AllReduce, plus extensive simulation and hardware-emulation validation showing substantial performance gains over static and per-step reconfiguration baselines. The work enables practical, offline synthesis of switching schedules that adapt to photonic technology parameters and demonstrates meaningful improvements in collective completion time with manageable synthesis overhead, offering a path toward adaptive photonic scale-up domains.

Abstract

As chip-to-chip silicon photonics gain traction for their bandwidth and energy efficiency, their circuit-switched nature raises a fundamental question for collective communication: when and how should the interconnect be reconfigured to realize these benefits? Establishing direct optical paths can reduce congestion and propagation delay, but each reconfiguration incurs non-negligible overhead, making naive per-step reconfiguration impractical. We present Harvest, a systematic approach for synthesizing topology reconfiguration schedules that minimize collective completion time in photonic interconnects. Given a collective communication algorithm and its fixed communication schedule, Harvest determines how the interconnect should evolve over the course of the collective, explicitly balancing reconfiguration delay against congestion and propagation delay. We reduce the synthesis problem into a dynamic program with an underlying topology optimization subproblem and show that the approach applies to arbitrary collective communication algorithms. Furthermore, we exploit the algorithmic structure of a well-known AllReduce algorithm (Recursive Doubling) to synthesize optimal reconfiguration schedules without using any optimizers. By parameterizing the formulation using reconfiguration delay, Harvest naturally adapts to various photonic technologies. Using packet-level and flow-level evaluations, as well as hardware emulation on commercial GPUs, we show that the schedules synthesized by Harvest significantly reduce collective completion time across multiple collective algorithms compared to static interconnects and reconfigure-every-step baselines.
Paper Structure (28 sections, 2 theorems, 18 equations, 12 figures, 1 table, 1 algorithm)

This paper contains 28 sections, 2 theorems, 18 equations, 12 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

For any starting step $a\in\{1,\dots,s\}$ and number of reconfigurations $k \ge 1$, the optimal completion time is where $t_c(\cdot,\cdot)$ is the completion time for steps $a-b$, given by Eq eq:tc-subproblem.

Figures (12)

  • Figure 1: Collective communication primitives
  • Figure 2: Various configurations of topology and communication patterns during recursive doubling allreduce algorithm. Black lines indicate the physical topology, and the red arrows indicate the communication between nodes.
  • Figure 3: Reconfiguration delay--aware circuit-switching schedules for recursive doubling reveal the full design spectrum between BvN and static topologies. The black curve denotes a lower bound on completion time, beyond which no reconfiguration schedule can achieve further improvement.
  • Figure 4: GPUs with on-chip optical I/O (with one or more transceivers) connect to a photonic interconnect that establishes direct optical paths between them.
  • Figure 5: [Simulations] We show Harvest speeds up collective completion time compared to BvN-based schedules and static topologies with link bandwidth=800Gbps, $\alpha$=500ns, and $\delta$=500ns
  • ...and 7 more figures

Theorems & Definitions (2)

  • Lemma 1: Recurrence
  • Theorem 1: Optimality of the schedule