Table of Contents
Fetching ...

Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity

Dmitry Bylinkin, Aleksandr Beznosikov

TL;DR

This work addresses the communication bottleneck in distributed optimization by marrying compression, local steps, and data similarity under a star topology. It introduces OLGA (unbiased compression) and EF-OLGA (biased compression with error feedback), which leverage variance reduction and Hessian similarity to achieve favorable communication complexities. Theoretical results provide CC-1/CC-2/CC-3 guarantees, with an optimal compression parameter roughly $\gamma_{\omega}=\Theta(\sqrt{M})$ that balances efficiency and accuracy, and corollaries for accelerated performance under similarity. Empirical validation on ridge and logistic regression tasks with LibSVM datasets confirms the proposed methods’ advantages and robustness across varying numbers of workers and compressors.

Abstract

In recent years, as data and problem sizes have increased, distributed learning has become an essential tool for training high-performance models. However, the communication bottleneck, especially for high-dimensional data, is a challenge. Several techniques have been developed to overcome this problem. These include communication compression and implementation of local steps, which work particularly well when there is similarity of local data samples. In this paper, we study the synergy of these approaches for efficient distributed optimization. We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity, leveraging variance reduction and error feedback frameworks. Our results are of record and confirmed by experiments on different average losses and datasets.

Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity

TL;DR

This work addresses the communication bottleneck in distributed optimization by marrying compression, local steps, and data similarity under a star topology. It introduces OLGA (unbiased compression) and EF-OLGA (biased compression with error feedback), which leverage variance reduction and Hessian similarity to achieve favorable communication complexities. Theoretical results provide CC-1/CC-2/CC-3 guarantees, with an optimal compression parameter roughly that balances efficiency and accuracy, and corollaries for accelerated performance under similarity. Empirical validation on ridge and logistic regression tasks with LibSVM datasets confirms the proposed methods’ advantages and robustness across varying numbers of workers and compressors.

Abstract

In recent years, as data and problem sizes have increased, distributed learning has become an essential tool for training high-performance models. However, the communication bottleneck, especially for high-dimensional data, is a challenge. Several techniques have been developed to overcome this problem. These include communication compression and implementation of local steps, which work particularly well when there is similarity of local data samples. In this paper, we study the synergy of these approaches for efficient distributed optimization. We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity, leveraging variance reduction and error feedback frameworks. Our results are of record and confirmed by experiments on different average losses and datasets.

Paper Structure

This paper contains 23 sections, 14 theorems, 120 equations, 36 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Consider an epoch of Algorithm alg:unbiased_compr_epoch. Let $h(x)=f_1(x)-f(x)+\frac{1}{2\theta}\|x\|^2$, where $\theta\leq\min\left\{\frac{\sqrt{p}\sqrt{M}}{8\delta\sqrt{\omega}}, \frac{1}{2\delta}\right\}$. Then the following inequality holds for every $x\in\mathbb{R}^d$:

Figures (36)

  • Figure 1: Comparison of state-of-the-art distributed methods. The comparison is made on \ref{['eq:quadr']} with $M=100$ and a9a dataset. The criterion is the communication time (CC-3). For methods with compression we vary the power of compression $\omega$.
  • Figure 2: Comparison of state-of-the-art distributed methods. The comparison is made on \ref{['eq:quadr']} with $M=50$ and a9a dataset. The criterion is the communication time (CC-3). For methods with compression we vary the power of compression $\omega$.
  • Figure 3: Comparison of state-of-the-art distributed methods. The comparison is made on \ref{['eq:quadr']} with $M=20$ and a9a dataset. The criterion is the communication time (CC-3). For methods with compression we vary the power of compression $\omega$.
  • Figure 4: Comparison of state-of-the-art distributed methods. The comparison is made on \ref{['eq:logloss']} with $M=70$ and a9a dataset. The criterion is the communication time (CC-3). For methods with compression we vary the power of compression $\omega$.
  • Figure 5: Comparison of state-of-the-art distributed methods. The comparison is made on \ref{['eq:logloss']} with $M=50$ and a9a dataset. The criterion is the communication time (CC-3). For methods with compression we vary the power of compression $\omega$.
  • ...and 31 more figures

Theorems & Definitions (23)

  • Definition 1
  • Definition 2
  • Definition 3
  • Lemma 1
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Definition 4
  • Lemma 2
  • ...and 13 more