Table of Contents
Fetching ...

Optimal Complexity in Byzantine-Robust Distributed Stochastic Optimization with Data Heterogeneity

Qiankun Shi, Jie Peng, Kun Yuan, Xiao Wang, Qing Ling

TL;DR

This work establishes tight fundamental limits for Byzantine-robust distributed stochastic optimization under data heterogeneity, showing the convergence error splits into a non-vanishing Byzantine term and a vanishing optimization term. It derives lower bounds on both Byzantine error and oracle-query complexity, and then designs Byrd-Nesterov-based algorithms with variance reduction that attain these bounds up to logarithmic factors, proving tightness. The results quantify how heterogeneity, Byzantine fraction, aggregator robustness, and stochastic noise interact to set unavoidable performance barriers, while the proposed methods demonstrate optimal robustness and convergence rates in both strongly convex and non-convex regimes. Numerical experiments on logistic regression and CNN training validate the theoretical findings and illustrate practical robustness against a wide range of Byzantine attacks.

Abstract

In this paper, we establish tight lower bounds for Byzantine-robust distributed first-order stochastic optimization methods in both strongly convex and non-convex stochastic optimization. We reveal that when the distributed nodes have heterogeneous data, the convergence error comprises two components: a non-vanishing Byzantine error and a vanishing optimization error. We establish the lower bounds on the Byzantine error and on the minimum number of queries to a stochastic gradient oracle required to achieve an arbitrarily small optimization error. Nevertheless, we identify significant discrepancies between our established lower bounds and the existing upper bounds. To fill this gap, we leverage the techniques of Nesterov's acceleration and variance reduction to develop novel Byzantine-robust distributed stochastic optimization methods that provably match these lower bounds, up to logarithmic factors, implying that our established lower bounds are tight.

Optimal Complexity in Byzantine-Robust Distributed Stochastic Optimization with Data Heterogeneity

TL;DR

This work establishes tight fundamental limits for Byzantine-robust distributed stochastic optimization under data heterogeneity, showing the convergence error splits into a non-vanishing Byzantine term and a vanishing optimization term. It derives lower bounds on both Byzantine error and oracle-query complexity, and then designs Byrd-Nesterov-based algorithms with variance reduction that attain these bounds up to logarithmic factors, proving tightness. The results quantify how heterogeneity, Byzantine fraction, aggregator robustness, and stochastic noise interact to set unavoidable performance barriers, while the proposed methods demonstrate optimal robustness and convergence rates in both strongly convex and non-convex regimes. Numerical experiments on logistic regression and CNN training validate the theoretical findings and illustrate practical robustness against a wide range of Byzantine attacks.

Abstract

In this paper, we establish tight lower bounds for Byzantine-robust distributed first-order stochastic optimization methods in both strongly convex and non-convex stochastic optimization. We reveal that when the distributed nodes have heterogeneous data, the convergence error comprises two components: a non-vanishing Byzantine error and a vanishing optimization error. We establish the lower bounds on the Byzantine error and on the minimum number of queries to a stochastic gradient oracle required to achieve an arbitrarily small optimization error. Nevertheless, we identify significant discrepancies between our established lower bounds and the existing upper bounds. To fill this gap, we leverage the techniques of Nesterov's acceleration and variance reduction to develop novel Byzantine-robust distributed stochastic optimization methods that provably match these lower bounds, up to logarithmic factors, implying that our established lower bounds are tight.

Paper Structure

This paper contains 37 sections, 19 theorems, 150 equations, 4 figures, 3 tables, 4 algorithms.

Key Result

Lemma 4

Given $\zeta^2 > 0$ and $\delta \in [0,\delta_{\rm max}]$, there exist a distributed problem in the form of prob-general having at least $(1-\delta)n$ honest nodes with function $f \in \mathcal{F}$, and a $(\delta_{\rm max},\rho)$-robust aggre- gator $\mathsf A \in \mathcal{A}$, such that for any me where $\tilde{x}$ is the output of $\mathsf M$, irrelevant with the number of iterations and the nu

Figures (4)

  • Figure 1: Worst-case maximum top-1 accuracy of DSGD, DSGDm and Algorithm 1.
  • Figure 2: Top-$1$ test accuracies of DSGD, DSGDm and Algorithm 1 with Med, CC, GM and TM for logistic regression, under BF, LF, IPM and ALIE attacks.
  • Figure 3: Top-$1$ test accuracies of DSGD, DSGDm and Algorithm 1 with Med, CC, GM and TM for convolutional neural network training, under BF, LF, IPM and ALIE attacks.
  • Figure 4: Evolution of the iterates.

Theorems & Definitions (22)

  • Definition 1
  • Definition 2: $(\delta_{\rm max},\rho)$-robust aggregator
  • Remark 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • Lemma 9
  • Lemma 10
  • ...and 12 more