Table of Contents
Fetching ...

Chip-to-chip photonic connectivity in multi-accelerator servers for ML

Abhishek Vijaya Kumar, Arjun Devraj, Darius Bunandar, Rachee Singh

TL;DR

To address the data-movement bottlenecks in multi-accelerator ML servers, the paper introduces Lumorph, a circuit-switched, chip-to-chip photonic interconnect built on the Lightpath platform. Lumorph targets multi-tenant resource slicing and optimized AllReduce through on-demand optical circuits and adapted collective algorithms that account for reconfiguration latency. Key results from a lab prototype and simulations show $74\%$ faster rack-scale collective communication and up to $1.7\times$ end-to-end ML training throughput, with reconfiguration latency of $3.7\,\mu s$ factored into the $\alpha$-cost. This photonic interconnect approach offers significant gains in resource utilization and scalability for AI workloads.

Abstract

We present a rack-scale compute architecture for ML using multi-accelerator servers connected via chip-to-chip silicon photonic components. Our architecture achieves (1) multi-tenanted resource slicing without fragmentation, (2) 74% faster rack-scale collective communication, and (3) 1.7X speedup in end-to-end ML training throughput.

Chip-to-chip photonic connectivity in multi-accelerator servers for ML

TL;DR

To address the data-movement bottlenecks in multi-accelerator ML servers, the paper introduces Lumorph, a circuit-switched, chip-to-chip photonic interconnect built on the Lightpath platform. Lumorph targets multi-tenant resource slicing and optimized AllReduce through on-demand optical circuits and adapted collective algorithms that account for reconfiguration latency. Key results from a lab prototype and simulations show faster rack-scale collective communication and up to end-to-end ML training throughput, with reconfiguration latency of factored into the -cost. This photonic interconnect approach offers significant gains in resource utilization and scalability for AI workloads.

Abstract

We present a rack-scale compute architecture for ML using multi-accelerator servers connected via chip-to-chip silicon photonic components. Our architecture achieves (1) multi-tenanted resource slicing without fragmentation, (2) 74% faster rack-scale collective communication, and (3) 1.7X speedup in end-to-end ML training throughput.

Paper Structure

This paper contains 5 sections, 4 figures.

Figures (4)

  • Figure 1: Server-scale photonic fabric.
  • Figure 2: (a) Example allocations in a multi-tenanted environment with fragmented compute. (b) Snapshot of circuits set up for AllReduce in each tenant's allocation.
  • Figure 3: Example configuration with 8 GPUs using Lumorph that is equivalent to SiPAC(2,3) karen_bcube.
  • Figure 4: Performance of Lumorph.