Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training

Tehila Dahan; Kfir Y. Levy

Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training

Tehila Dahan, Kfir Y. Levy

TL;DR

The CTMA is introduced, an efficient meta-aggregator that upgrades baseline aggregators to optimal performance levels, while requiring low computational demands and a recently developed gradient estimation technique based on a double-momentum strategy within the Byzantine context is proposed.

Abstract

In this paper, we investigate the challenging framework of Byzantine-robust training in distributed machine learning (ML) systems, focusing on enhancing both efficiency and practicality. As distributed ML systems become integral for complex ML tasks, ensuring resilience against Byzantine failures-where workers may contribute incorrect updates due to malice or error-gains paramount importance. Our first contribution is the introduction of the Centered Trimmed Meta Aggregator (CTMA), an efficient meta-aggregator that upgrades baseline aggregators to optimal performance levels, while requiring low computational demands. Additionally, we propose harnessing a recently developed gradient estimation technique based on a double-momentum strategy within the Byzantine context. Our paper highlights its theoretical and practical advantages for Byzantine-robust training, especially in simplifying the tuning process and reducing the reliance on numerous hyperparameters. The effectiveness of this technique is supported by theoretical insights within the stochastic convex optimization (SCO) framework and corroborated by empirical evidence.

Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training

TL;DR

Abstract

Paper Structure (38 sections, 12 theorems, 71 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 38 sections, 12 theorems, 71 equations, 10 figures, 2 tables, 2 algorithms.

Introduction
Related Work.
Setting
Notation.
Assumptions.
Robust Aggregation and Meta-Aggregators
Centered Trimmed Meta Aggregator (CTMA)
Synchronous Robust Training
$\mu^2$-SGD
Synchronous Robust $\mu^2$-SGD
Experiments
CTMA Versus Existing Meta Aggregators.
$\mu^2$-SGD Versus Momentum.
Bounded Smoothness Variance Assumption
Synchronous $\mu^2$-SGD Analysis
...and 23 more sections

Key Result

Lemma 3.1

Under the assumptions outlined in Definition def1, if CTMA receives a $(c_\delta, \delta)$-robust aggregator, $\mathcal{A}$; then the output of CTMA, $\hat{\mathbf{x}}$, is $(16\delta(1+c_\delta), \delta)$-robust.

Figures (10)

Figure 1: Performance comparison of CTMA with existing meta-aggregators (Conf. 1 in Table \ref{['tab:configurations']}).
Figure 2: Performance comparison of $\mu^2$-SGD with standard momentum.
Figure 3: Performance comparison of CTMA with existing meta-aggregators, and $\mu^2$-SGD with momentum under sign-flipping and label-flipping attacks with 8/17 Byzantine workers on the MNIST dataset (Conf. 1 in Table \ref{['tab:configurations']}). Here, we observe that CTMA enhances the performance of both $\mu^2$-SGD and momentum. Furthermore, CTMA performs at least as well as, if not better than, other meta-aggregators. The integration of nnm with CTMA can further improve performance. Notably, $\mu^2$-SGD demonstrates high stability compared to momentum, even without the assistance of any meta-aggregator. This stability is valuable for increasing resilience against heavy attacks and boosting overall performance.
Figure 4: Performance comparison of CTMA with existing meta-aggregators, and $\mu^2$-SGD with momentum under a weaker sign-flipping and label-flipping attack with 4/17 Byzantine workers on the MNIST dataset (Conf. 1 in Table \ref{['tab:configurations']}). In this scenario, even though the performance of momentum is noisier over iterations compared to $\mu^2$-SGD, it does not require additional stability to perform effectively and outperforms $\mu^2$-SGD. CTMA enhances the performance of both $\mu^2$-SGD and momentum, maintaining consistent improvement as observed under the heavier attacks shown in Figure \ref{['fig:lf+sf-8']}.
Figure 5: Performance comparison of CTMA with existing meta-aggregators, and $\mu^2$-SGD with momentum under SOTA low-variance attacks, empire and little, with 8/17 Byzantine workers on the MNIST dataset (Conf. 1 in Table \ref{['tab:configurations']}). These low-variance attacks are harder to detect and represent an especially severe attack scenario. In this context, CTMA performs poorly due to its strong reliance on the variance among the workers' outputs. In contrast, NNM performs very effectively for both momentum and $\mu^2$-SGD. The low variance in $\mu^2$-SGD enhances the effectiveness of NNM, making $\mu^2$-SGD more robust and particularly valuable against heavy low-variance attacks compared to momentum, with or without the addition of NNM.
...and 5 more figures

Theorems & Definitions (27)

Definition 3.1
Lemma 3.1
proof
Remark 3.1
Theorem 4.1
Lemma 4.1
proof
Theorem 4.2: Synchronous Byzantine $\mu^2$-SGD
Remark 4.1
Remark 4.2
...and 17 more

Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training

TL;DR

Abstract

Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (27)