Table of Contents
Fetching ...

Short-circuiting Rings for Low-Latency AllReduce

Sarah-Michelle Hammer, Stefan Schmid, Rachee Singh, Vamsi Addanki

TL;DR

This work revisits the assumption that Ring AllReduce is suboptimal for small messages on ring-based GPU interconnects, showing that when propagation delays are properly modeled, Ring can outperform Recursive Doubling. It introduces an adaptive photonic interconnect framework with a simple circuit-switching heuristic that reconfigures per collective step to short-circuit long paths, balancing reconfiguration delay with propagation and congestion benefits. A formal threshold-based strategy decides when to reconfigure, yielding substantial speed-ups in latency-bound regimes (notably for very small messages) and illustrating the potential of in-collective reconfiguration for future photonic networks. The study highlights practical challenges like synchronization and control-plane design, and points to rich future research directions toward practical, scalable photonic switching for AllReduce in GPU clusters.

Abstract

Efficient collective communication is critical for many distributed ML and HPC applications. In this context, it is widely believed that the Ring algorithm for the AllReduce collective communication operation is optimal only for large messages, while Recursive Doubling is preferable for small ones due to its logarithmic number of steps compared to the linear number for Ring. In this paper, we challenge this long-held assumption and show that the Ring algorithm can remain optimal even for short messages in ring-based GPU-to-GPU topologies, once realistic propagation delays and link capacity constraints are accounted for. We find that the total propagation delay for both Ring and Recursive Doubling essentially sums to the same value, but the latter incurs significantly higher congestion due to longer hop counts, leading to increased completion times. This surprising result motivates our case for in-collective adaptive topologies, particularly in the context of emerging photonic interconnects, which can break through the limitations of static topology designs at the collective communication granularity. We design a \emph{simple and fast} heuristic for circuit-switching that enables Recursive Doubling to exploit dynamically reconfigurable photonic paths, carefully balancing reconfiguration delays, propagation latencies, and link congestion to minimize overall completion time. Our preliminary evaluations, using realistic reconfiguration delays, show that our circuit-switching schedules enable faster completion times for Recursive Doubling, even compared to Ring AllReduce on static ring topologies. We conclude by highlighting key challenges and future research directions for realizing practical, in-collective photonic switching.

Short-circuiting Rings for Low-Latency AllReduce

TL;DR

This work revisits the assumption that Ring AllReduce is suboptimal for small messages on ring-based GPU interconnects, showing that when propagation delays are properly modeled, Ring can outperform Recursive Doubling. It introduces an adaptive photonic interconnect framework with a simple circuit-switching heuristic that reconfigures per collective step to short-circuit long paths, balancing reconfiguration delay with propagation and congestion benefits. A formal threshold-based strategy decides when to reconfigure, yielding substantial speed-ups in latency-bound regimes (notably for very small messages) and illustrating the potential of in-collective reconfiguration for future photonic networks. The study highlights practical challenges like synchronization and control-plane design, and points to rich future research directions toward practical, scalable photonic switching for AllReduce in GPU clusters.

Abstract

Efficient collective communication is critical for many distributed ML and HPC applications. In this context, it is widely believed that the Ring algorithm for the AllReduce collective communication operation is optimal only for large messages, while Recursive Doubling is preferable for small ones due to its logarithmic number of steps compared to the linear number for Ring. In this paper, we challenge this long-held assumption and show that the Ring algorithm can remain optimal even for short messages in ring-based GPU-to-GPU topologies, once realistic propagation delays and link capacity constraints are accounted for. We find that the total propagation delay for both Ring and Recursive Doubling essentially sums to the same value, but the latter incurs significantly higher congestion due to longer hop counts, leading to increased completion times. This surprising result motivates our case for in-collective adaptive topologies, particularly in the context of emerging photonic interconnects, which can break through the limitations of static topology designs at the collective communication granularity. We design a \emph{simple and fast} heuristic for circuit-switching that enables Recursive Doubling to exploit dynamically reconfigurable photonic paths, carefully balancing reconfiguration delays, propagation latencies, and link congestion to minimize overall completion time. Our preliminary evaluations, using realistic reconfiguration delays, show that our circuit-switching schedules enable faster completion times for Recursive Doubling, even compared to Ring AllReduce on static ring topologies. We conclude by highlighting key challenges and future research directions for realizing practical, in-collective photonic switching.

Paper Structure

This paper contains 9 sections, 5 equations, 4 figures.

Figures (4)

  • Figure 1: Realistic network simulations with Astra-Sim show that Ring AllReduce clearly outperforms Recursive Doubling algorithm even for small message sizes, especially when the propagation delay is low. Y-axis indicates the completion time of Recursive Doubling relative to Ring AllReduce, on a ring network topology.
  • Figure 2: Left: For $m=32$B: the best reconfiguration threshold, $T$, decreases with larger propagation delays and lower reconfiguration delays. Our strategy speeds up the completion of reduce-scatter by up to $474\%$ compared to the static Ring algorithm. Middle: For $m=4$MB: the best $T$ across all delay pairs is $T=1$, always reconfiguring. Yet, speed-up times of our strategy compared to the static Ring is more limited than at smaller message sizes, achieving $58\%$. Right: For $m=32$MB: similar to $4$MB the best $T$ remains $T=1$ and the best speed-up of $8.1\%$ is achieved at $1000$ns propagation.
  • Figure 3: For small messages ($32$B): The best reconfiguration strategy for Recursive Doubling shifts towards early reconfiguration (small $T$ values) as reconfiguration delay decreases and propagation delay increases
  • Figure :