Improving the Worst-Case Bidirectional Communication Complexity for Nonconvex Distributed Optimization under Function Similarity

Kaja Gruntkowska; Alexander Tyurin; Peter Richtárik

Improving the Worst-Case Bidirectional Communication Complexity for Nonconvex Distributed Optimization under Function Similarity

Kaja Gruntkowska, Alexander Tyurin, Peter Richtárik

TL;DR

This paper focuses on optimizing the server-to-worker communication, uncovering inefficiencies in prevalent downlink compression approaches and introduces MARINA-P, a novel method for downlink compression, employing a collection of correlated compressors.

Abstract

Effective communication between the server and workers plays a key role in distributed optimization. In this paper, we focus on optimizing the server-to-worker communication, uncovering inefficiencies in prevalent downlink compression approaches. Considering first the pure setup where the uplink communication costs are negligible, we introduce MARINA-P, a novel method for downlink compression, employing a collection of correlated compressors. Theoretical analyses demonstrates that MARINA-P with permutation compressors can achieve a server-to-worker communication complexity improving with the number of workers, thus being provably superior to existing algorithms. We further show that MARINA-P can serve as a starting point for extensions such as methods supporting bidirectional compression. We introduce M3, a method combining MARINA-P with uplink compression and a momentum step, achieving bidirectional compression with provable improvements in total communication complexity as the number of workers increases. Theoretical findings align closely with empirical experiments, underscoring the efficiency of the proposed algorithms.

Improving the Worst-Case Bidirectional Communication Complexity for Nonconvex Distributed Optimization under Function Similarity

TL;DR

Abstract

Paper Structure (42 sections, 50 theorems, 207 equations, 6 figures, 2 tables, 5 algorithms)

This paper contains 42 sections, 50 theorems, 207 equations, 6 figures, 2 tables, 5 algorithms.

Introduction
Related Work
Contributions
Lower Bound under Smoothness
The MARINA-P Method
Three ways to compress
Recap: permutation compressors Perm$K$
Warmup: homogeneous quadratics
Functional $(L_A, L_B)$ Inequality
The Convergence Theory of MARINA-P with Perm$K$
Estimating $L_A$ and $L_B$ in the General Case
M3: A New Bidirectional Method
The Convergence Theory of M3
Experimental Highlights
Three unbiased ways to compress
...and 27 more sections

Key Result

Theorem 3.1

Under Assumptions ass:lipschitz_constant, ass:lower_bound and ass:independent, all methods in which the server communicates with clients using different and independent unbiased compressors from $\mathbb{U}\left(\omega\right)$ and sends one compressed vector to each worker cannot converge before $\O

Figures (6)

Figure 1: Experiments on the quadratic optimization problem from Section \ref{['sec:marinap_quadratic']}. We plot the norm of the gradient w.r.t. # of coordinates sent from the server to the workers.
Figure 2: Experiments on the quadratic optimization problem from Section \ref{['sec:core_m3']}. We plot the norm of the gradient w.r.t. # of coordinates sent from the server (s-to-w) and from the workers (w-to-s).
Figure 3: Experiments on the autoencoder task from Section \ref{['sec:autoencode']}. We plot the norm of the gradient w.r.t. # of coordinates sent from the server (s-to-w) and from the workers (w-to-s).
Figure 4: Experiments on the quadratic optimization problem from Section \ref{['sec:exp_quad_lalb']} with $n=10$ for $L_A^2 \in \left\{ 0,1,10,100 \right\}$ and $L_B^2 \in \left\{ 100,1000,10000,100000 \right\}$.
Figure 5: Experiments on the quadratic optimization problem from Section \ref{['sec:exp_quad_lalb']} with $n=100$ for $L_A^2 \in \left\{ 0,1,10,100 \right\}$ and $L_B^2 \in \left\{ 100,1000,10000,100000 \right\}$.
...and 1 more figures

Theorems & Definitions (94)

Definition 1.3
Definition 1.4
Theorem 3.1: Slightly Less Formal Reformulation of Theorem \ref{['theorem:lower_bound']}
Remark 3.2
Definition 4.1: Perm$K$ (for $d \geq n$ and $n|d$)
Remark 4.3
Theorem 4.4
Theorem 4.5
Theorem 4.6
Corollary 4.7
...and 84 more

Improving the Worst-Case Bidirectional Communication Complexity for Nonconvex Distributed Optimization under Function Similarity

TL;DR

Abstract

Improving the Worst-Case Bidirectional Communication Complexity for Nonconvex Distributed Optimization under Function Similarity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (94)