Table of Contents
Fetching ...

A Stochastic Approximation Approach for Efficient Decentralized Optimization on Random Networks

Chung-Yiu Yau, Haoming Liu, Hoi-To Wai

TL;DR

This paper tackles decentralized optimization over time-varying random networks with unreliable and bandwidth-constrained communications. It introduces the Fully Stochastic Primal Dual Algorithm (FSPDA), which uses a stochastic augmented Lagrangian to incorporate topology randomness and seeks its saddle points via stochastic approximation. The authors develop two variants, FSPDA-SA and FSPDA-STORM, achieving rates of $O(1/\sqrt{T})$ and $O(1/T^{2/3})$ for smooth (possibly non-convex) objectives, with PL-condition enabling linear convergence; both support sparsified communication and asynchronous operation. Empirical results on MNIST and Imagenet demonstrate improved iteration and communication efficiency compared to baselines, validating the framework’s robustness to topology randomness and its practical utility for large-scale distributed learning.

Abstract

A challenging problem in decentralized optimization is to develop algorithms with fast convergence on random and time varying topologies under unreliable and bandwidth-constrained communication network. This paper studies a stochastic approximation approach with a Fully Stochastic Primal Dual Algorithm (FSPDA) framework. Our framework relies on a novel observation that randomness in time varying topology can be incorporated in a stochastic augmented Lagrangian formulation, whose expected value admits saddle points that coincide with stationary solutions of the decentralized optimization problem. With the FSPDA framework, we develop two new algorithms supporting efficient sparsified communication on random time varying topologies -- FSPDA-SA allows agents to execute multiple local gradient steps depending on the time varying topology to accelerate convergence, and FSPDA-STORM further incorporates a variance reduction step to improve sample complexity. For problems with smooth (possibly non-convex) objective function, within $T$ iterations, we show that FSPDA-SA (resp. FSPDA-STORM) finds an $\mathcal{O}( 1/\sqrt{T} )$-stationary (resp. $\mathcal{O}( 1/T^{2/3} )$) solution. Numerical experiments show the benefits of the FSPDA algorithms.

A Stochastic Approximation Approach for Efficient Decentralized Optimization on Random Networks

TL;DR

This paper tackles decentralized optimization over time-varying random networks with unreliable and bandwidth-constrained communications. It introduces the Fully Stochastic Primal Dual Algorithm (FSPDA), which uses a stochastic augmented Lagrangian to incorporate topology randomness and seeks its saddle points via stochastic approximation. The authors develop two variants, FSPDA-SA and FSPDA-STORM, achieving rates of and for smooth (possibly non-convex) objectives, with PL-condition enabling linear convergence; both support sparsified communication and asynchronous operation. Empirical results on MNIST and Imagenet demonstrate improved iteration and communication efficiency compared to baselines, validating the framework’s robustness to topology randomness and its practical utility for large-scale distributed learning.

Abstract

A challenging problem in decentralized optimization is to develop algorithms with fast convergence on random and time varying topologies under unreliable and bandwidth-constrained communication network. This paper studies a stochastic approximation approach with a Fully Stochastic Primal Dual Algorithm (FSPDA) framework. Our framework relies on a novel observation that randomness in time varying topology can be incorporated in a stochastic augmented Lagrangian formulation, whose expected value admits saddle points that coincide with stationary solutions of the decentralized optimization problem. With the FSPDA framework, we develop two new algorithms supporting efficient sparsified communication on random time varying topologies -- FSPDA-SA allows agents to execute multiple local gradient steps depending on the time varying topology to accelerate convergence, and FSPDA-STORM further incorporates a variance reduction step to improve sample complexity. For problems with smooth (possibly non-convex) objective function, within iterations, we show that FSPDA-SA (resp. FSPDA-STORM) finds an -stationary (resp. ) solution. Numerical experiments show the benefits of the FSPDA algorithms.

Paper Structure

This paper contains 34 sections, 18 theorems, 122 equations, 9 figures, 7 tables, 3 algorithms.

Key Result

Theorem 3.5

Under Assumptions assm:lip, assm:rand-graph, assm:f_var, assm:graph_var. Suppose that the step sizes satisfy the conditions defined in eq:ss_cond. Then, for any $T \geq 1$ with the random stopping iteration ${\sf T} \sim {\rm Unif} \{0,...,T-1\}$, the iterates generated by FSPDA-SA satisfy for any ${\tt a}>0$, where $F_0$, ${\mathbb{C}}_{\sigma}$ are defined in eq:ft_def_restated, eq:bbC_bound.

Figures (9)

  • Figure 1: Feed-forward neural network classification training on MNIST using $10^6$ iterations.
  • Figure 2: Resnet-50 classification training on Imagenet.
  • Figure 3: Illustration of a (time-varying) random graph ${\cal G}(\xi)$ for primal variable of dimension $d=3$ on a ring network of $n = 5$ nodes. Solid lines represent active edges while dashed lines represent disconnected edges. In this example, node 2 is considered as idle in an asynchronous environment. ${\bf C}_{15}(\xi)$ is a diagonal matrix such that ${\rm diag}({\bf C}_{15}(\xi)) = (1, 0, 1)$.
  • Figure 4: Feed-forward neural network classification training on MNIST with two levels of data heterogeneity.
  • Figure 5: Feed-forward neural network classification training on shuffled MNIST. Random graph of $k$ edges in expectation ($k \in \{1, 10, 45\}$) is drawn from a complete topology per iteration.
  • ...and 4 more figures

Theorems & Definitions (18)

  • Theorem 3.5
  • Corollary 3.7
  • Theorem 3.9
  • Lemma C.1
  • Lemma C.2
  • Lemma C.3
  • Lemma C.4
  • Theorem C.5
  • Lemma C.6
  • Lemma D.1
  • ...and 8 more