Table of Contents
Fetching ...

Accelerated Distributed Stochastic Non-Convex Optimization over Time-Varying Directed Networks

Yiyue Chen, Abolfazl Hashemi, Haris Vikalo

TL;DR

This work proposes an algorithm that leverages stochastic gradient descent with momentum and gradient tracking to solve distributed nonconvex optimization problems over time-varying networks and tackles the challenges that arise when analyzing dynamic network systems that communicate gradient acceleration components.

Abstract

Distributed stochastic non-convex optimization problems have recently received attention due to the growing interest of signal processing, computer vision, and natural language processing communities in applications deployed over distributed learning systems (e.g., federated learning). We study the setting where the data is distributed across the nodes of a time-varying directed network, a topology suitable for modeling dynamic networks experiencing communication delays and straggler effects. The network nodes, which can access only their local objectives and query a stochastic first-order oracle to obtain gradient estimates, collaborate to minimize a global objective function by exchanging messages with their neighbors. We propose an algorithm, novel to this setting, that leverages stochastic gradient descent with momentum and gradient tracking to solve distributed non-convex optimization problems over time-varying networks. To analyze the algorithm, we tackle the challenges that arise when analyzing dynamic network systems which communicate gradient acceleration components. We prove that the algorithm's oracle complexity is $\mathcal{O}(1/ε^{1.5})$, and that under Polyak-$Ł$ojasiewicz condition the algorithm converges linearly to a steady error state. The proposed scheme is tested on several learning tasks: a non-convex logistic regression experiment on the MNIST dataset, an image classification task on the CIFAR-10 dataset, and an NLP classification test on the IMDB dataset. We further present numerical simulations with an objective that satisfies the PL condition. The results demonstrate superior performance of the proposed framework compared to the existing related methods.

Accelerated Distributed Stochastic Non-Convex Optimization over Time-Varying Directed Networks

TL;DR

This work proposes an algorithm that leverages stochastic gradient descent with momentum and gradient tracking to solve distributed nonconvex optimization problems over time-varying networks and tackles the challenges that arise when analyzing dynamic network systems that communicate gradient acceleration components.

Abstract

Distributed stochastic non-convex optimization problems have recently received attention due to the growing interest of signal processing, computer vision, and natural language processing communities in applications deployed over distributed learning systems (e.g., federated learning). We study the setting where the data is distributed across the nodes of a time-varying directed network, a topology suitable for modeling dynamic networks experiencing communication delays and straggler effects. The network nodes, which can access only their local objectives and query a stochastic first-order oracle to obtain gradient estimates, collaborate to minimize a global objective function by exchanging messages with their neighbors. We propose an algorithm, novel to this setting, that leverages stochastic gradient descent with momentum and gradient tracking to solve distributed non-convex optimization problems over time-varying networks. To analyze the algorithm, we tackle the challenges that arise when analyzing dynamic network systems which communicate gradient acceleration components. We prove that the algorithm's oracle complexity is , and that under Polyak-ojasiewicz condition the algorithm converges linearly to a steady error state. The proposed scheme is tested on several learning tasks: a non-convex logistic regression experiment on the MNIST dataset, an image classification task on the CIFAR-10 dataset, and an NLP classification test on the IMDB dataset. We further present numerical simulations with an objective that satisfies the PL condition. The results demonstrate superior performance of the proposed framework compared to the existing related methods.

Paper Structure

This paper contains 22 sections, 11 theorems, 83 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Suppose Assumptions assumption1 -- assumption3 hold. Let step size $\alpha$ satisfy Moreover, let the momentum step size $\beta$ be such that Then it holds that where $\delta = \max_t \sqrt{1 - \frac{\min_i([\phi_{t+1}]_i)}{\max_i([\phi_t]_i) (n-1)^2 n^{2(n+2)}} } \in (0, 1)$ is the network contraction parameter; $\phi_m = d/\min_{t, i} [\phi_t]_i$ is proportional to the inverse of the smalles

Figures (4)

  • Figure 1: Performance on MNIST. Push-ASGD achieves lower loss and higher correct rate than the competing schemes.
  • Figure 2: Performance on CIFAR-10. Push-ASGD achieves lower loss and higher correct rate than the competing schemes.
  • Figure 3: Performance on the natural language processing task. Push-ASGD achieves lower loss and higher accuracy than the competing schemes.
  • Figure 4: In simulations of a setting where PL condition holds, Push-ASGD converges faster than other benchmarking algorithms.

Theorems & Definitions (11)

  • Theorem 1
  • Corollary 1.1
  • Theorem 2
  • Corollary 2.1
  • lemma 1
  • lemma 2
  • lemma 3
  • lemma 4
  • lemma 5
  • lemma 6
  • ...and 1 more