OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud
Ertza Warraich, Omer Shabtai, Khalid Manaa, Shay Vargaftik, Yonatan Piasetzky, Matty Kadosh, Lalith Suresh, Muhammad Shahbaz
TL;DR
OptiReduce tackles the cloud-tail problem in distributed deep learning by replacing deterministic AllReduce with tail-robust, bounded-time primitives. It combines Transpose AllReduce (TAR) for direct peer-to-peer reduction, Unreliable Bounded Transport (UBT) with adaptive timeouts and dynamic incast, and the Hadamard Transform (HT) to disperse gradient loss, all while maintaining compatibility with Gloo and PyTorch. Through extensive evaluation in local and CloudLab environments, OptiReduce delivers substantial time-to-accuracy gains (up to about 70% over Gloo and 30% over NCCL in cloud settings) and preserves convergence accuracy despite gradient drops. The work demonstrates that exploiting DDL resiliency and cloud-tail characteristics can yield practical, scalable improvements for real-world large-scale training without provider network changes, with clear avenues for accelerator-based reductions and transport offloads in future work.
Abstract
We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients -- providing an efficient balance between (tail) performance and the resulting accuracy of the trained models. Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs' tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that OptiReduce achieves 70% and 30% faster time-to-accuracy (TTA), on average, when operating in shared, cloud environments (e.g., CloudLab) compared to Gloo and NCCL, respectively.
