Table of Contents
Fetching ...

Communication Compression for Byzantine Robust Learning: New Efficient Algorithms and Improved Rates

Ahmad Rammal, Kaja Gruntkowska, Nikita Fedin, Eduard Gorbunov, Peter Richtárik

TL;DR

A new Byzantine-robust method with compression is proposed - Byz-DASHA-PAGE - and it is proved that the new method has better convergence rate, smaller neighborhood size in the heterogeneous case, and tolerates more Byzantine workers under over-parametrization than the previous method with SOTA theoretical convergence guarantees.

Abstract

Byzantine robustness is an essential feature of algorithms for certain distributed optimization problems, typically encountered in collaborative/federated learning. These problems are usually huge-scale, implying that communication compression is also imperative for their resolution. These factors have spurred recent algorithmic and theoretical developments in the literature of Byzantine-robust learning with compression. In this paper, we contribute to this research area in two main directions. First, we propose a new Byzantine-robust method with compression - Byz-DASHA-PAGE - and prove that the new method has better convergence rate (for non-convex and Polyak-Lojasiewicz smooth optimization problems), smaller neighborhood size in the heterogeneous case, and tolerates more Byzantine workers under over-parametrization than the previous method with SOTA theoretical convergence guarantees (Byz-VR-MARINA). Secondly, we develop the first Byzantine-robust method with communication compression and error feedback - Byz-EF21 - along with its bidirectional compression version - Byz-EF21-BC - and derive the convergence rates for these methods for non-convex and Polyak-Lojasiewicz smooth case. We test the proposed methods and illustrate our theoretical findings in the numerical experiments.

Communication Compression for Byzantine Robust Learning: New Efficient Algorithms and Improved Rates

TL;DR

A new Byzantine-robust method with compression is proposed - Byz-DASHA-PAGE - and it is proved that the new method has better convergence rate, smaller neighborhood size in the heterogeneous case, and tolerates more Byzantine workers under over-parametrization than the previous method with SOTA theoretical convergence guarantees.

Abstract

Byzantine robustness is an essential feature of algorithms for certain distributed optimization problems, typically encountered in collaborative/federated learning. These problems are usually huge-scale, implying that communication compression is also imperative for their resolution. These factors have spurred recent algorithmic and theoretical developments in the literature of Byzantine-robust learning with compression. In this paper, we contribute to this research area in two main directions. First, we propose a new Byzantine-robust method with compression - Byz-DASHA-PAGE - and prove that the new method has better convergence rate (for non-convex and Polyak-Lojasiewicz smooth optimization problems), smaller neighborhood size in the heterogeneous case, and tolerates more Byzantine workers under over-parametrization than the previous method with SOTA theoretical convergence guarantees (Byz-VR-MARINA). Secondly, we develop the first Byzantine-robust method with communication compression and error feedback - Byz-EF21 - along with its bidirectional compression version - Byz-EF21-BC - and derive the convergence rates for these methods for non-convex and Polyak-Lojasiewicz smooth case. We test the proposed methods and illustrate our theoretical findings in the numerical experiments.
Paper Structure (51 sections, 31 theorems, 128 equations, 10 figures, 3 tables, 5 algorithms)

This paper contains 51 sections, 31 theorems, 128 equations, 10 figures, 3 tables, 5 algorithms.

Key Result

Theorem 2.1

Let Assumptions as:smoothness, as:hessian_variance, as:hessian_variance_local and as:bounded_heterogeneity hold. Assume that $0 < \gamma \leq (L + \sqrt{\eta})^{-1}$, $\delta < \left( (8c + 4\sqrt{c})B \right)^{-1}$ and initialize $g_i^0 = \nabla f_i(x^0)$ for all $i\in {\cal G}$, where $\eta = \fra where $\delta^0=f(x^0) - f^*$, $A = 1 - \left( 8c\delta+\sqrt{8c\delta/G} \right)B$ and $\widehat{x

Figures (10)

  • Figure 1: Convergence in terms of the number of iterations in the homogeneous non-convex setting.
  • Figure 2: Convergence in terms of the number of bits sent in the heterogeneous non-convex setting.
  • Figure 3: Communication complexity comparison in the heterogeneous non-convex setting on the w8a dataset.
  • Figure 4: Communication complexity comparison in the heterogeneous strongly convex setting on the phishing dataset.
  • Figure 5: Communication complexity comparison in the heterogeneous strongly convex setting on the w8a dataset.
  • ...and 5 more figures

Theorems & Definitions (57)

  • Definition 1.1: $(\delta, c)$-Robust Aggregator
  • Definition 1.2: Unbiased compressor
  • Definition 1.3: Contractive compressor
  • Theorem 2.1
  • Theorem 2.2
  • Theorem 3.1
  • Lemma C.1: Lemma $2$ of li2021page
  • Lemma C.2: Lemma $5$ of richtarik2021ef21
  • Lemma E.1: Bound on the variance
  • proof
  • ...and 47 more