Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

Aditya Devarakonda; Ramakrishnan Kannan

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

Aditya Devarakonda, Ramakrishnan Kannan

TL;DR

This work tackles the high communication cost of distributed SGD on memory-rich clusters by introducing HybridSGD, a 2D parallel SGD that blends FedAvg-like row-partitioning with $s$-step SGD-like column-partitioning. The authors provide theoretical cost and communication bounds under a 2D processor grid, implement the methods in C++/MPI, and empirically demonstrate that HybridSGD achieves substantial speedups over both FedAvg and $s$-step SGD on Cray hardware, while preserving convergence characteristics. Key contributions include a concrete HybridSGD design, a comprehensive cost/convergence analysis, and extensive experiments showing up to $5.3\times$ speedup over $s$-step SGD and up to $121\times$ over FedAvg on logistic regression tasks. The results suggest HybridSGD offers a practical pathway to scalable distributed optimization by adjusting the processor-grid dimensions to balance communication, computation, and convergence, with potential extensions to GPU-based and deep learning contexts.

Abstract

Distributed-memory implementations of numerical optimization algorithm, such as stochastic gradient descent (SGD), require interprocessor communication at every iteration of the algorithm. On modern distributed-memory clusters where communication is more expensive than computation, the scalability and performance of these algorithms are limited by communication cost. This work generalizes prior work on 1D $s$-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) which attains a continuous performance trade off between the two baseline algorithms. We present theoretical analysis which show the convergence, computation, communication, and memory trade offs between $s$-step SGD, FedAvg, 2D parallel SGD, and other parallel SGD variants. We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system. Our empirical results show that HybridSGD achieves better convergence than FedAvg at similar processor scales while attaining speedups of $5.3\times$ over $s$-step SGD and speedups up to $121\times$ over FedAvg when used to solve binary classification tasks using the convex, logistic regression model on datasets obtained from the LIBSVM repository.

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

TL;DR

This work tackles the high communication cost of distributed SGD on memory-rich clusters by introducing HybridSGD, a 2D parallel SGD that blends FedAvg-like row-partitioning with

-step SGD-like column-partitioning. The authors provide theoretical cost and communication bounds under a 2D processor grid, implement the methods in C++/MPI, and empirically demonstrate that HybridSGD achieves substantial speedups over both FedAvg and

-step SGD on Cray hardware, while preserving convergence characteristics. Key contributions include a concrete HybridSGD design, a comprehensive cost/convergence analysis, and extensive experiments showing up to

speedup over

-step SGD and up to

over FedAvg on logistic regression tasks. The results suggest HybridSGD offers a practical pathway to scalable distributed optimization by adjusting the processor-grid dimensions to balance communication, computation, and convergence, with potential extensions to GPU-based and deep learning contexts.

Abstract

-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) which attains a continuous performance trade off between the two baseline algorithms. We present theoretical analysis which show the convergence, computation, communication, and memory trade offs between

-step SGD, FedAvg, 2D parallel SGD, and other parallel SGD variants. We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system. Our empirical results show that HybridSGD achieves better convergence than FedAvg at similar processor scales while attaining speedups of

over

-step SGD and speedups up to

over FedAvg when used to solve binary classification tasks using the convex, logistic regression model on datasets obtained from the LIBSVM repository.

Paper Structure (21 sections, 6 theorems, 5 equations, 9 figures, 3 tables, 3 algorithms)

This paper contains 21 sections, 6 theorems, 5 equations, 9 figures, 3 tables, 3 algorithms.

Introduction
Background
s-step methods
Communication-efficient methods
Notation
Optimization Problem
Parallel Algorithms Design
Communication-efficient SGD
HybridSGD Design.
Algorithms Analysis
Computation, Convergence, and Storage
Communication
Convergence
Experiments
Convergence vs. Runtime
...and 6 more sections

Key Result

Theorem 5.1.1

$K$ iterations of SGD with $\bm{A}$ distributed across a 2D processor grid (2D SGD) of size $p = p_r \times p_c$ processor must perform $F = O(K \cdot (\frac{bc}{p} + n))$ flops and store $M = O(cm/p + n)$ words in memory per iteration.

Figures (9)

Figure 1: Comparison of convergence behavior of Gradient Descent (GD) and Stochastic Gradient Descent (SGD) for fixed batch size ($b = 16$) and learning rate ($\eta = 1$) on the w1a and breast-cancer datasets. The x-axis was normalized such that every iteration of GD corresponds to $m/b$ iterations of SGD.
Figure 2: Various matrix partitioning strategies explored in this work. We assume that there is only enough memory to store a single copy of $\bm{A}$. While this illustrates shows partitioning strategies on dense matrices, the idea can be generalized to sparse matrices.
Figure 3: Convergence behavior of $s$-step SGD for a fixed batch size ($b = 16$), fixed learning rate ($\eta = 1$), and varying values of $s$ on the w1a and breast-cancer datasets. The maximum entry-wise error in the $s$-step solution was on the order of $10^{-14}$ (\ref{['fig:w1a-sstep']}) and $10^{-15}$ (\ref{['fig:breast-cancer-sstep']}) when compared to the SGD solution.
Figure 4: Convergence behavior of FedAvg for a fixed batch size ($b$) and aggregation delay ($\tau$) on the w1a and breast-cancer dataset for various values of $p_r$. For FedAvg $p_r = p$. Learning rate ($\eta$) was tuned to ensure convergence.
Figure 5: Comparison of convergence behavior and accuracy of $s$-step SGD and HybridSGD at small batch sizes ($b = 4$) after offline tuning of $s$ on the url dataset (sparse). We show convergence in terms of the number of gradient evaluations (\ref{['fig:url-sstep-gradevals']}) and the running time (\ref{['fig:url-sstep-perf']}).
...and 4 more figures

Theorems & Definitions (12)

Theorem 5.1.1
proof
Theorem 5.1.2
proof
Theorem 5.1.3
proof
Theorem 5.2.1
proof
Theorem 5.2.2
proof
...and 2 more

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

TL;DR

Abstract

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (12)