Table of Contents
Fetching ...

Adaptive Consensus Gradients Aggregation for Scaled Distributed Training

Yoni Choukroun, Shlomi Azoulay, Pavel Kisilev

TL;DR

This work forms the aggregation problem as an objective-aware subspace optimization problem, and derives an efficient weighting scheme for gradients, guided by subspace coefficients, which demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.

Abstract

Distributed machine learning has recently become a critical paradigm for training large models on vast datasets. We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints. While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question. In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization. By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients. We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation. Our method demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.

Adaptive Consensus Gradients Aggregation for Scaled Distributed Training

TL;DR

This work forms the aggregation problem as an objective-aware subspace optimization problem, and derives an efficient weighting scheme for gradients, guided by subspace coefficients, which demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.

Abstract

Distributed machine learning has recently become a critical paradigm for training large models on vast datasets. We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints. While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question. In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization. By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients. We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation. Our method demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.

Paper Structure

This paper contains 23 sections, 14 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Pytorch paszke2019pytorch implementation of the AdaCons DDP communication hook.
  • Figure 2: Performance of the aggregation schemes on the stochastic linear regression tasks for various numbers of workers and effective batch sizes. Additional experiments are given in Appendix \ref{['appendix:linear']}.
  • Figure 3: Performance of the aggregation schemes on the MLPerf Imagenet classification task for various numbers of workers.
  • Figure 4: Performance of the aggregation schemes on the MLPerf RetinaNet object detection task for various numbers of workers.
  • Figure 5: Performance of the aggregation schemes on the MLPerf DLRM task for various batch sizes. Additional results and visualizations are given in Appendix \ref{['appendix:dlrm']}.
  • ...and 6 more figures