Table of Contents
Fetching ...

OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

Ertza Warraich, Omer Shabtai, Khalid Manaa, Shay Vargaftik, Yonatan Piasetzky, Matty Kadosh, Lalith Suresh, Muhammad Shahbaz

TL;DR

OptiReduce tackles the cloud-tail problem in distributed deep learning by replacing deterministic AllReduce with tail-robust, bounded-time primitives. It combines Transpose AllReduce (TAR) for direct peer-to-peer reduction, Unreliable Bounded Transport (UBT) with adaptive timeouts and dynamic incast, and the Hadamard Transform (HT) to disperse gradient loss, all while maintaining compatibility with Gloo and PyTorch. Through extensive evaluation in local and CloudLab environments, OptiReduce delivers substantial time-to-accuracy gains (up to about 70% over Gloo and 30% over NCCL in cloud settings) and preserves convergence accuracy despite gradient drops. The work demonstrates that exploiting DDL resiliency and cloud-tail characteristics can yield practical, scalable improvements for real-world large-scale training without provider network changes, with clear avenues for accelerator-based reductions and transport offloads in future work.

Abstract

We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients -- providing an efficient balance between (tail) performance and the resulting accuracy of the trained models. Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs' tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that OptiReduce achieves 70% and 30% faster time-to-accuracy (TTA), on average, when operating in shared, cloud environments (e.g., CloudLab) compared to Gloo and NCCL, respectively.

OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

TL;DR

OptiReduce tackles the cloud-tail problem in distributed deep learning by replacing deterministic AllReduce with tail-robust, bounded-time primitives. It combines Transpose AllReduce (TAR) for direct peer-to-peer reduction, Unreliable Bounded Transport (UBT) with adaptive timeouts and dynamic incast, and the Hadamard Transform (HT) to disperse gradient loss, all while maintaining compatibility with Gloo and PyTorch. Through extensive evaluation in local and CloudLab environments, OptiReduce delivers substantial time-to-accuracy gains (up to about 70% over Gloo and 30% over NCCL in cloud settings) and preserves convergence accuracy despite gradient drops. The work demonstrates that exploiting DDL resiliency and cloud-tail characteristics can yield practical, scalable improvements for real-world large-scale training without provider network changes, with clear avenues for accelerator-based reductions and transport offloads in future work.

Abstract

We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients -- providing an efficient balance between (tail) performance and the resulting accuracy of the trained models. Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs' tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that OptiReduce achieves 70% and 30% faster time-to-accuracy (TTA), on average, when operating in shared, cloud environments (e.g., CloudLab) compared to Gloo and NCCL, respectively.
Paper Structure (53 sections, 20 figures, 2 tables)

This paper contains 53 sections, 20 figures, 2 tables.

Figures (20)

  • Figure 1: A backpropagation pass in distributed data-parallel (DDP) training. Multiple gradient aggregation (GA) runs share a bucket ($B_i$) worth of gradient entries among worker nodes ($W_n$), in parallel. The two send(bcast)/receive stages (1, 3) in GA incur the most time---contributing to the tail latency and stalling workers.
  • Figure 2: Architectures for gradient aggregation: Parameter Server (PS) and AllReduce (AR).
  • Figure 3: The latency ECDF (in milliseconds) showing tail-to-median ratio ($\bm{P_{99/50}}$) observed across leading AI cloud platforms.
  • Figure 4: The OptiReduce design: Transpose AllReduce with colocated parameter servers, Unreliable Bounded Transport, and Hadamard Transform.
  • Figure 5: A comparison of Ring versus OptiReduce.
  • ...and 15 more figures