Table of Contents
Fetching ...

Accelerated Distributed Optimization with Compression and Error Feedback

Yuan Gao, Anton Rodomanov, Jeremy Rack, Sebastian U. Stich

TL;DR

This work tackles the bottleneck of communication in distributed stochastic optimization by integrating Nesterov acceleration with contractive compression and error feedback, addressing the longstanding gap in theory for accelerated methods under contraction in the general convex regime. The authors introduce ADEF, a novel algorithm that uses gradient-difference compression and enhanced error feedback to compensate compression errors, and they provide a general descent framework for accelerated methods with inexact updates. The main theoretical result is the first accelerated convergence rate for stochastic distributed optimization with contractive compression in the general convex setting, with a rate of $T = O\left(\frac{R_0^2\sigma^2}{n\varepsilon^2} + \frac{\sqrt{L}\,R_0^2\sigma}{\delta^2\varepsilon^{3/2}} + \frac{\sqrt{\ell R_0^2}}{\delta^2\sqrt{\varepsilon}}\right)$ to achieve $F_T \le \varepsilon$; in the deterministic case ($\sigma^2=0$) it attains an accelerated $O(1/\sqrt{\varepsilon})$ rate with a $1/\delta^2$ dependence. Empirical results on synthetic and MNIST tasks corroborate the theory, showing reduced communication and competitive convergence relative to existing methods. The work thus advances scalable, communication-efficient training of large models by marrying compression with acceleration in a principled framework.

Abstract

Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization with contractive compression remains limited, particularly in conjunction with Nesterov acceleration -- a cornerstone for achieving faster convergence in optimization. In this paper, we propose a novel algorithm, ADEF (Accelerated Distributed Error Feedback), which integrates Nesterov acceleration, contractive compression, error feedback, and gradient difference compression. We prove that ADEF achieves the first accelerated convergence rate for stochastic distributed optimization with contractive compression in the general convex regime. Numerical experiments validate our theoretical findings and demonstrate the practical efficacy of ADEF in reducing communication costs while maintaining fast convergence.

Accelerated Distributed Optimization with Compression and Error Feedback

TL;DR

This work tackles the bottleneck of communication in distributed stochastic optimization by integrating Nesterov acceleration with contractive compression and error feedback, addressing the longstanding gap in theory for accelerated methods under contraction in the general convex regime. The authors introduce ADEF, a novel algorithm that uses gradient-difference compression and enhanced error feedback to compensate compression errors, and they provide a general descent framework for accelerated methods with inexact updates. The main theoretical result is the first accelerated convergence rate for stochastic distributed optimization with contractive compression in the general convex setting, with a rate of to achieve ; in the deterministic case () it attains an accelerated rate with a dependence. Empirical results on synthetic and MNIST tasks corroborate the theory, showing reduced communication and competitive convergence relative to existing methods. The work thus advances scalable, communication-efficient training of large models by marrying compression with acceleration in a principled framework.

Abstract

Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization with contractive compression remains limited, particularly in conjunction with Nesterov acceleration -- a cornerstone for achieving faster convergence in optimization. In this paper, we propose a novel algorithm, ADEF (Accelerated Distributed Error Feedback), which integrates Nesterov acceleration, contractive compression, error feedback, and gradient difference compression. We prove that ADEF achieves the first accelerated convergence rate for stochastic distributed optimization with contractive compression in the general convex regime. Numerical experiments validate our theoretical findings and demonstrate the practical efficacy of ADEF in reducing communication costs while maintaining fast convergence.

Paper Structure

This paper contains 19 sections, 16 theorems, 88 equations, 6 figures, 1 table, 6 algorithms.

Key Result

Theorem 4.2

Given assumption:convexityassumption:smoothnessassumption:bounded_variance_general, for all $T\geq 1$, for $(\mathbf{x}_t,\mathbf{y}_t,\mathbf{v}_t)_{t=0}^\infty$ generated by alg:acc-inexact, it holds that: where we write $w_t\coloneqq \min\{2,a_tL\}+\frac{4La_t^2}{A_t}+\frac{4La_{t+1}^2}{A_{t+1}}$.

Figures (6)

  • Figure 1: Competitive performance. Comparison of the performance of ADEF, EConrol,EF and NEOLITHIC on the MNIST classification problem. We use Top-$K$ compression with $\delta=0.1$. We see that ADEF performs competetively in both the loss and the accuracy, while NEOLITHIC performs the worst.
  • Figure 2: Achieving acceleration. Performance of ADEF and EF with $\sigma^2=0$. We set a target error of $0.01$. We see that ADEF achieves the accelerated $\mathcal{O}(1/T^2)$ rate.
  • Figure 3: Achieving linear speedup. The performance of ADEF with increasing number of clients $n$ for the synthetic logistic regressoin problem. We fix $\gamma$ to be $0.0001$. The error that the algorithm stabalizes around decreases as $n$ increases.
  • Figure : Repeated Compressor $\mathcal{C}_R$
  • Figure : Repeated Compressor $\mathcal{C}_R$
  • ...and 1 more figures

Theorems & Definitions (33)

  • Definition 2.1
  • Theorem 4.2
  • Lemma 5.0
  • Lemma 5.0
  • Theorem 5.1
  • Remark 5.2
  • Remark 5.3
  • Lemma A.1
  • proof
  • Theorem B.1
  • ...and 23 more