Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

Arman Bolatov; Samuel Horváth; Martin Takáč; Eduard Gorbunov

Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

Arman Bolatov, Samuel Horváth, Martin Takáč, Eduard Gorbunov

Abstract

We consider distributed optimization under Byzantine attacks in the presence of $(L_0,L_1)$-smoothness, a generalization of standard $L$-smoothness that captures functions with state-dependent gradient Lipschitz constants. We propose Byz-NSGDM, a normalized stochastic gradient descent method with momentum that achieves robustness against Byzantine workers while maintaining convergence guarantees. Our algorithm combines momentum normalization with Byzantine-robust aggregation enhanced by Nearest Neighbor Mixing (NNM) to handle both the challenges posed by $(L_0,L_1)$-smoothness and Byzantine adversaries. We prove that Byz-NSGDM achieves a convergence rate of $O(K^{-1/4})$ up to a Byzantine bias floor proportional to the robustness coefficient and gradient heterogeneity. Experimental validation on heterogeneous MNIST classification, synthetic $(L_0,L_1)$-smooth optimization, and character-level language modeling with a small GPT model demonstrates the effectiveness of our approach against various Byzantine attack strategies. An ablation study further shows that Byz-NSGDM is robust across a wide range of momentum and learning rate choices.

Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

Abstract

We consider distributed optimization under Byzantine attacks in the presence of

-smoothness, a generalization of standard

-smoothness that captures functions with state-dependent gradient Lipschitz constants. We propose Byz-NSGDM, a normalized stochastic gradient descent method with momentum that achieves robustness against Byzantine workers while maintaining convergence guarantees. Our algorithm combines momentum normalization with Byzantine-robust aggregation enhanced by Nearest Neighbor Mixing (NNM) to handle both the challenges posed by

-smoothness and Byzantine adversaries. We prove that Byz-NSGDM achieves a convergence rate of

up to a Byzantine bias floor proportional to the robustness coefficient and gradient heterogeneity. Experimental validation on heterogeneous MNIST classification, synthetic

-smooth optimization, and character-level language modeling with a small GPT model demonstrates the effectiveness of our approach against various Byzantine attack strategies. An ablation study further shows that Byz-NSGDM is robust across a wide range of momentum and learning rate choices.

Paper Structure (33 sections, 17 theorems, 107 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 33 sections, 17 theorems, 107 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Contributions.
Assumptions and Problem Setup
Problem.
Assumptions.
Related Work
Byzantine-robust distributed optimization.
(L0,L1)-smoothness in optimization.
New Method: Byz-NSGDM
Convergence Analysis
Proof roadmap.
Experiments
Attack Strategies
Bit/Sign Flipping (BF).
Label Flipping (LF).
...and 18 more sections

Key Result

Theorem 1

Let Assumptions as:lower_bound--as:heterogeneity_detailed hold and let the server use a $(\delta,\kappa)$-robust aggregator. Run Algorithm alg:Byz-NSGDM with where Define $\Delta^* := \tfrac{1}{G}\sum_{i\in\mathcal{G}}(f^*-f_i^*)$ and $V_0 := \tfrac{1}{G}\sum_{i\in\mathcal{G}}\mathbb{E}[\|v_i^0-\nabla f_i(x^0)\|]$. Then

Figures (4)

Figure 1: Ablation study on heterogeneous MNIST under Bit Flipping attack with RFA aggregation. (a) Final test accuracy heatmap across momentum ($\beta$) and learning rate ($\gamma_0$). (b) Training curves for each momentum value using its best learning rate. (c) Final accuracy vs momentum for different learning rates.
Figure 2: Test accuracy curves on heterogeneous MNIST under Byzantine attacks. Each row corresponds to a different attack (BF, LF, Mimic), and each column to a different aggregator (RFA, Krum, CM). Lines show mean accuracy over 3 seeds with shaded regions indicating $\pm 1$ standard deviation.
Figure 3: Gradient norm evolution (log scale) for synthetic $(L_0,L_1)$-smooth optimization under Byzantine attacks. Each row shows a different attack (BF, Mimic, ALIE), and each column a different aggregator (RFA, Krum, CM). Lines show mean gradient norm over 3 seeds with shaded regions indicating $\pm 1$ standard deviation. Legend includes tuned learning rates.
Figure 4: Validation perplexity (log scale) on Shakespeare character-level language modeling under Byzantine attacks. Each row shows a different attack (BF, Mimic, ALIE), and each column a different aggregator (RFA, Krum, CM).

Theorems & Definitions (39)

Definition 1: Byzantine-robust aggregator
Remark 1
Theorem 1: Convergence of Byz-NSGDM
Remark 2: Fixed-horizon schedule
Corollary 1
Remark 3
Remark 4: Theory vs. practice step-schedules
Lemma 1: Lemma 1 from khiriat2024
Lemma 2: Lemma 2 from khiriat2024
Lemma 3
...and 29 more

Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

Abstract

Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (39)