Table of Contents
Fetching ...

Hybrid Approach to Parallel Stochastic Gradient Descent

Aakash Sudhirbhai Vora, Dhrumil Chetankumar Joshi, Aksh Kantibhai Patel

TL;DR

This paper tackles data-parallel SGD by addressing the trade-off between speed and convergence quality in distributed training. It introduces the Smooth Switch algorithm, a hybrid method that starts with asynchronous updates and gradually shifts to synchronous aggregation via a threshold control, aiming to achieve faster early progress with reliable later convergence. Experiments on MNIST and CIFAR-10 show the hybrid method attains higher accuracy and lower loss than purely synchronous or asynchronous approaches, even under varying batch sizes, step sizes, and simulated communication delays. The approach offers practical benefits for distributed training of CNNs, particularly in environments with heterogeneous workers and network latency, by balancing throughput and update reliability.

Abstract

Stochastic Gradient Descent is used for large datasets to train models to reduce the training time. On top of that data parallelism is widely used as a method to efficiently train neural networks using multiple worker nodes in parallel. Synchronous and asynchronous approach to data parallelism is used by most systems to train the model in parallel. However, both of them have their drawbacks. We propose a third approach to data parallelism which is a hybrid between synchronous and asynchronous approaches, using both approaches to train the neural network. When the threshold function is selected appropriately to gradually shift all parameter aggregation from asynchronous to synchronous, we show that in a given time period our hybrid approach outperforms both asynchronous and synchronous approaches.

Hybrid Approach to Parallel Stochastic Gradient Descent

TL;DR

This paper tackles data-parallel SGD by addressing the trade-off between speed and convergence quality in distributed training. It introduces the Smooth Switch algorithm, a hybrid method that starts with asynchronous updates and gradually shifts to synchronous aggregation via a threshold control, aiming to achieve faster early progress with reliable later convergence. Experiments on MNIST and CIFAR-10 show the hybrid method attains higher accuracy and lower loss than purely synchronous or asynchronous approaches, even under varying batch sizes, step sizes, and simulated communication delays. The approach offers practical benefits for distributed training of CNNs, particularly in environments with heterogeneous workers and network latency, by balancing throughput and update reliability.

Abstract

Stochastic Gradient Descent is used for large datasets to train models to reduce the training time. On top of that data parallelism is widely used as a method to efficiently train neural networks using multiple worker nodes in parallel. Synchronous and asynchronous approach to data parallelism is used by most systems to train the model in parallel. However, both of them have their drawbacks. We propose a third approach to data parallelism which is a hybrid between synchronous and asynchronous approaches, using both approaches to train the neural network. When the threshold function is selected appropriately to gradually shift all parameter aggregation from asynchronous to synchronous, we show that in a given time period our hybrid approach outperforms both asynchronous and synchronous approaches.
Paper Structure (13 sections, 3 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 13 sections, 3 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: System Architecture
  • Figure 2: MNIST samples
  • Figure 3: CIFAR 10 samples
  • Figure 4: Testing accuracy, testing loss and training loss on MNIST for step size 300, and batch size 32(left) and 64(right)
  • Figure 5: Testing accuracy, testing loss and training loss on MNIST for step size 500, and batch size 32(left) and 64(right)
  • ...and 5 more figures