Accelerating Distributed ML Training via Selective Synchronization
Sahil Tyagi, Martin Swany
TL;DR
This work tackles the communication bottleneck of bulk-synchronous distributed training by introducing SelSync, a semi-synchronous method that dynamically blends local-SGD and synchronized updates based on gradient-change significance. By detecting crucial updates with a delta-based threshold on relative gradient change and favoring parameter aggregation over gradient aggregation, SelSync maintains BSP-like convergence while reducing communication overhead. The design includes IID-friendly SelDP partitioning and non-IID data handling via data-injection, achieving strong accuracy and speedups across CNNs and transformers on IID and non-IID data. Practically, SelSync delivers up to several-fold speedups with improved or comparable accuracy to BSP, offering a robust approach for scalable distributed ML deployment.
Abstract
In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$.
