Table of Contents
Fetching ...

Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

Arman Bolatov, Samuel Horváth, Martin Takáč, Eduard Gorbunov

Abstract

We consider distributed optimization under Byzantine attacks in the presence of $(L_0,L_1)$-smoothness, a generalization of standard $L$-smoothness that captures functions with state-dependent gradient Lipschitz constants. We propose Byz-NSGDM, a normalized stochastic gradient descent method with momentum that achieves robustness against Byzantine workers while maintaining convergence guarantees. Our algorithm combines momentum normalization with Byzantine-robust aggregation enhanced by Nearest Neighbor Mixing (NNM) to handle both the challenges posed by $(L_0,L_1)$-smoothness and Byzantine adversaries. We prove that Byz-NSGDM achieves a convergence rate of $O(K^{-1/4})$ up to a Byzantine bias floor proportional to the robustness coefficient and gradient heterogeneity. Experimental validation on heterogeneous MNIST classification, synthetic $(L_0,L_1)$-smooth optimization, and character-level language modeling with a small GPT model demonstrates the effectiveness of our approach against various Byzantine attack strategies. An ablation study further shows that Byz-NSGDM is robust across a wide range of momentum and learning rate choices.

Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

Abstract

We consider distributed optimization under Byzantine attacks in the presence of -smoothness, a generalization of standard -smoothness that captures functions with state-dependent gradient Lipschitz constants. We propose Byz-NSGDM, a normalized stochastic gradient descent method with momentum that achieves robustness against Byzantine workers while maintaining convergence guarantees. Our algorithm combines momentum normalization with Byzantine-robust aggregation enhanced by Nearest Neighbor Mixing (NNM) to handle both the challenges posed by -smoothness and Byzantine adversaries. We prove that Byz-NSGDM achieves a convergence rate of up to a Byzantine bias floor proportional to the robustness coefficient and gradient heterogeneity. Experimental validation on heterogeneous MNIST classification, synthetic -smooth optimization, and character-level language modeling with a small GPT model demonstrates the effectiveness of our approach against various Byzantine attack strategies. An ablation study further shows that Byz-NSGDM is robust across a wide range of momentum and learning rate choices.
Paper Structure (33 sections, 17 theorems, 107 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 33 sections, 17 theorems, 107 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Let Assumptions as:lower_bound--as:heterogeneity_detailed hold and let the server use a $(\delta,\kappa)$-robust aggregator. Run Algorithm alg:Byz-NSGDM with where Define $\Delta^* := \tfrac{1}{G}\sum_{i\in\mathcal{G}}(f^*-f_i^*)$ and $V_0 := \tfrac{1}{G}\sum_{i\in\mathcal{G}}\mathbb{E}[\|v_i^0-\nabla f_i(x^0)\|]$. Then

Figures (4)

  • Figure 1: Ablation study on heterogeneous MNIST under Bit Flipping attack with RFA aggregation. (a) Final test accuracy heatmap across momentum ($\beta$) and learning rate ($\gamma_0$). (b) Training curves for each momentum value using its best learning rate. (c) Final accuracy vs momentum for different learning rates.
  • Figure 2: Test accuracy curves on heterogeneous MNIST under Byzantine attacks. Each row corresponds to a different attack (BF, LF, Mimic), and each column to a different aggregator (RFA, Krum, CM). Lines show mean accuracy over 3 seeds with shaded regions indicating $\pm 1$ standard deviation.
  • Figure 3: Gradient norm evolution (log scale) for synthetic $(L_0,L_1)$-smooth optimization under Byzantine attacks. Each row shows a different attack (BF, Mimic, ALIE), and each column a different aggregator (RFA, Krum, CM). Lines show mean gradient norm over 3 seeds with shaded regions indicating $\pm 1$ standard deviation. Legend includes tuned learning rates.
  • Figure 4: Validation perplexity (log scale) on Shakespeare character-level language modeling under Byzantine attacks. Each row shows a different attack (BF, Mimic, ALIE), and each column a different aggregator (RFA, Krum, CM).

Theorems & Definitions (39)

  • Definition 1: Byzantine-robust aggregator
  • Remark 1
  • Theorem 1: Convergence of Byz-NSGDM
  • Remark 2: Fixed-horizon schedule
  • Corollary 1
  • Remark 3
  • Remark 4: Theory vs. practice step-schedules
  • Lemma 1: Lemma 1 from khiriat2024
  • Lemma 2: Lemma 2 from khiriat2024
  • Lemma 3
  • ...and 29 more