Table of Contents
Fetching ...

Accelerating Distributed ML Training via Selective Synchronization

Sahil Tyagi, Martin Swany

TL;DR

This work tackles the communication bottleneck of bulk-synchronous distributed training by introducing SelSync, a semi-synchronous method that dynamically blends local-SGD and synchronized updates based on gradient-change significance. By detecting crucial updates with a delta-based threshold on relative gradient change and favoring parameter aggregation over gradient aggregation, SelSync maintains BSP-like convergence while reducing communication overhead. The design includes IID-friendly SelDP partitioning and non-IID data handling via data-injection, achieving strong accuracy and speedups across CNNs and transformers on IID and non-IID data. Practically, SelSync delivers up to several-fold speedups with improved or comparable accuracy to BSP, offering a robust approach for scalable distributed ML deployment.

Abstract

In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$.

Accelerating Distributed ML Training via Selective Synchronization

TL;DR

This work tackles the communication bottleneck of bulk-synchronous distributed training by introducing SelSync, a semi-synchronous method that dynamically blends local-SGD and synchronized updates based on gradient-change significance. By detecting crucial updates with a delta-based threshold on relative gradient change and favoring parameter aggregation over gradient aggregation, SelSync maintains BSP-like convergence while reducing communication overhead. The design includes IID-friendly SelDP partitioning and non-IID data handling via data-injection, achieving strong accuracy and speedups across CNNs and transformers on IID and non-IID data. Practically, SelSync delivers up to several-fold speedups with improved or comparable accuracy to BSP, offering a robust approach for scalable distributed ML deployment.

Abstract

In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14.
Paper Structure (20 sections, 4 equations, 12 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 4 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: (a) Relative throughput from training on NVIDIA V100 GPUs connected over 5Gbps NIC. (b) Training on 10 V100s with 1 label per-worker for non-IID CIFAR10 on ResNet101 and 10 labels per-worker for non-IID CIFAR100 on VGG11.
  • Figure 2: Setting worker batch-size to $Nb$ in SSP ensures same amount of work is done at each iteration as in BSP. However, computation and memory requirements also increase with batch-size, as measured on a Tesla K80 GPU.
  • Figure 3: Gradient Kernel density estimates (or KDE on the Y-axis) over the epochs for ResNet101 layer layer4_1_conv1_weight and Transformer layer transformer_encoder_layers_0_norm1_weight. Gradients are volatile in the early epochs ((a) and (c)), but get smaller and saturate as gradient variance decreases over the course of training ((b) and (d)).
  • Figure 4: We compute the largest eigenvalue of second-order Hessian on every iteration along with first-order gradient variance. Changes in eigenvalue of the Hessian help detect critical periods, and it can be approximated with the variance of first-order gradients. The latter has a significantly lower computational overhead.
  • Figure 5: Correlation between relative gradient change and model convergence in BSP. A rise or decline in test accuracy or perplexity is accompanied by changes in the gradients as well. As convergence plateaus, so does $\triangle (g_{i})$.
  • ...and 7 more figures