Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization
Aditya Devarakonda, Ramakrishnan Kannan
TL;DR
This work tackles the high communication cost of distributed SGD on memory-rich clusters by introducing HybridSGD, a 2D parallel SGD that blends FedAvg-like row-partitioning with $s$-step SGD-like column-partitioning. The authors provide theoretical cost and communication bounds under a 2D processor grid, implement the methods in C++/MPI, and empirically demonstrate that HybridSGD achieves substantial speedups over both FedAvg and $s$-step SGD on Cray hardware, while preserving convergence characteristics. Key contributions include a concrete HybridSGD design, a comprehensive cost/convergence analysis, and extensive experiments showing up to $5.3\times$ speedup over $s$-step SGD and up to $121\times$ over FedAvg on logistic regression tasks. The results suggest HybridSGD offers a practical pathway to scalable distributed optimization by adjusting the processor-grid dimensions to balance communication, computation, and convergence, with potential extensions to GPU-based and deep learning contexts.
Abstract
Distributed-memory implementations of numerical optimization algorithm, such as stochastic gradient descent (SGD), require interprocessor communication at every iteration of the algorithm. On modern distributed-memory clusters where communication is more expensive than computation, the scalability and performance of these algorithms are limited by communication cost. This work generalizes prior work on 1D $s$-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) which attains a continuous performance trade off between the two baseline algorithms. We present theoretical analysis which show the convergence, computation, communication, and memory trade offs between $s$-step SGD, FedAvg, 2D parallel SGD, and other parallel SGD variants. We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system. Our empirical results show that HybridSGD achieves better convergence than FedAvg at similar processor scales while attaining speedups of $5.3\times$ over $s$-step SGD and speedups up to $121\times$ over FedAvg when used to solve binary classification tasks using the convex, logistic regression model on datasets obtained from the LIBSVM repository.
