Table of Contents
Fetching ...

A Computation and Communication Efficient Method for Distributed Nonconvex Problems in the Partial Participation Setting

Alexander Tyurin, Peter Richtárik

TL;DR

The paper tackles distributed nonconvex optimization under partial participation and communication constraints. It introduces DASHA-PP, a variance-reduced, compression-enabled framework that acknowledges partial participation and adapts update rules to maintain convergence without assuming bounded inter-client gradient dissimilarity. Theoretical results establish that DASHA-PP achieves optimal oracle complexity and state-of-the-art communication complexity in the partial participation setting, with variants tailored to finite-sum and stochastic regimes and supporting Rand$K$-type compressors. Empirical results corroborate the theoretical findings and illustrate the practical benefits of combining variance reduction, compression, and partial participation in distributed learning systems.

Abstract

We present a new method that includes three key components of distributed optimization and federated learning: variance reduction of stochastic gradients, partial participation, and compressed communication. We prove that the new method has optimal oracle complexity and state-of-the-art communication complexity in the partial participation setting. Regardless of the communication compression feature, our method successfully combines variance reduction and partial participation: we get the optimal oracle complexity, never need the participation of all nodes, and do not require the bounded gradients (dissimilarity) assumption.

A Computation and Communication Efficient Method for Distributed Nonconvex Problems in the Partial Participation Setting

TL;DR

The paper tackles distributed nonconvex optimization under partial participation and communication constraints. It introduces DASHA-PP, a variance-reduced, compression-enabled framework that acknowledges partial participation and adapts update rules to maintain convergence without assuming bounded inter-client gradient dissimilarity. Theoretical results establish that DASHA-PP achieves optimal oracle complexity and state-of-the-art communication complexity in the partial participation setting, with variants tailored to finite-sum and stochastic regimes and supporting Rand-type compressors. Empirical results corroborate the theoretical findings and illustrate the practical benefits of combining variance reduction, compression, and partial participation in distributed learning systems.

Abstract

We present a new method that includes three key components of distributed optimization and federated learning: variance reduction of stochastic gradients, partial participation, and compressed communication. We prove that the new method has optimal oracle complexity and state-of-the-art communication complexity in the partial participation setting. Regardless of the communication compression feature, our method successfully combines variance reduction and partial participation: we get the optimal oracle complexity, never need the participation of all nodes, and do not require the bounded gradients (dissimilarity) assumption.
Paper Structure (37 sections, 45 theorems, 307 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 37 sections, 45 theorems, 307 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 2

Suppose that Assumptions ass:lower_bound, ass:lipschitz_constant, ass:nodes_lipschitz_constant, ass:compressors and ass:partial_participation hold. Let us take $a = \frac{p_{\textnormal{a}}}{2 \omega + 1} ,$$b = \frac{p_{\textnormal{a}}}{2 - p_{\textnormal{a}}},$ and $g^{0}_i = h^{0}_i = \nabla f_i(x^0)$ for all $i \in [n]$ in Algorithm alg:main_algorithm (DASHA-PP), then ${\rm E}\left[\left\| \na

Figures (5)

  • Figure 1: Classification task with the real-sim dataset.
  • Figure 2: Classification task on real-sim
  • Figure 3: Classification task on MNIST
  • Figure 4: Classification task on real-sim
  • Figure 5: Classification task on MNIST

Theorems & Definitions (76)

  • Definition 1
  • Theorem 2
  • Theorem 3
  • Corollary 1
  • Corollary 2
  • Theorem 4
  • Corollary 3
  • Corollary 4
  • Lemma 1
  • proof
  • ...and 66 more