SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

Hadi Mohaghegh Dolatabadi; Thalaiyasingam Ajanthan; Sameera Ramasinghe; Chamin P Hewa Koneputugodage; Gil Avraham; Yan Zuo; Violetta Shevchenko; Alexander Long

SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputugodage, Gil Avraham, Yan Zuo, Violetta Shevchenko, Alexander Long

TL;DR

This work proposes SENTINEL, a verification mechanism for PP training without computation duplication that verifies sequential activation/gradient transmission between layers, and provides theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training.

Abstract

Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes. While existing Byzantine-tolerant literature addresses data parallel (DP) training through robust aggregation methods, pipeline parallelism (PP) presents fundamentally distinct challenges. In PP, model layers are distributed across workers where the activations and their gradients flow between stages rather than being aggregated, making traditional DP approaches inapplicable. We propose SENTINEL, a verification mechanism for PP training without computation duplication. SENTINEL employs lightweight momentum-based monitoring using exponential moving averages (EMAs) to detect corrupted inter-stage communication. Unlike existing Byzantine-tolerant approaches for DP that aggregate parameter gradients across replicas, our approach verifies sequential activation/gradient transmission between layers. We provide theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training. Experiments demonstrate successful training of up to 4B-parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.

SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

TL;DR

Abstract

Paper Structure (95 sections, 9 theorems, 72 equations, 22 figures, 20 tables, 7 algorithms)

This paper contains 95 sections, 9 theorems, 72 equations, 22 figures, 20 tables, 7 algorithms.

Introduction
Problem Statement
Threat Model
Attack Variants.
Momentum-based Verification of Workers
Proposed Method
Exponential Moving Average as Reference Point.
Choice of Distance Measure $\Omega$.
Automatic Threshold $\tau$ Updates.
Handling Cascading Effects.
Theoretical Analysis
Related Work
Experimental Results
Metrics.
Verification Performance.
...and 80 more sections

Key Result

Theorem 3.1

Consider a distributed training setup that utilizes PP to split the model layers across workers and uses momentum-based verification to verify each worker's contribution in the forward or backward pass.We do not consider dishonest activity during the "all-reduce" operation for syncing parameter grad

Figures (22)

Figure 1: Scatter plot of F1-scores vs. validation loss for more than 75 different attack setups. Strong attacks (higher loss) are caught more often (high F1-score), while weak attacks may slip through. Our verification method thus catches the most harmful attacks that would disrupt training.
Figure 2: Distributed threat models. (a) In DP, workers hold full model replicas and only send weight gradients. Traditional Byzantine-tolerant methods consider this case and use robust aggregation. (b & c)The threat model considered in this paper (see \ref{['sec:problem_statement:threat_model']}). In PP, workers hold individual layers, and send intermediate activations$\boldsymbol{h}$ and activation gradients$\boldsymbol{g}$, thus corruptions directly affect other workers. (b) In the standard setting there is a fixed grid of pipeline stages and data parallel replicas, and communication is routed through our verifier nodes. (c) In the SWARM setting designed for decentralized setups, data is stochastically routed by trainer nodes. Workers send their computations ($\boldsymbol{h}$ and $\boldsymbol{g}$) to trainers, who then route them to an available worker in the next stage. Our proposed verifiers can seamlessly be added on top of these trainer nodes (\ref{['sec:appendix:swarm']}).
Figure 3: Ablation studies on the effect of various elements on the verification performance.
Figure 4: Loss when training Llama-3-0.6B models using SWARM ryabinin2023swarm with 128 distributed workers on preemptible AWS instances. Workers employ various activation/gradient manipulation attacks to disrupt training. While in the absence of verification training gets disrupted, Sentinel can successfully protect training from divergence.
Figure 5: Verification protocol for handling compromised workers during distributed training. During forward propagation, a worker at stage $s+1$ is detected as potentially compromised (shown in yellow). The verifier nodes continue forwarding activations to subsequent stages without alerting downstream workers to avoid disrupting the pipeline. During backward propagation, instead of propagating gradients computed by the compromised worker, verifier nodes substitute gradient momentum values to maintain training stability. Communication flows through verifier nodes between consecutive pipeline stages, though direct worker-to-worker arrows are shown for visual clarity.
...and 17 more figures

Theorems & Definitions (17)

Definition 2.1: Data and Pipeline Parallel Threat Model
Theorem 3.1: Convergence Under Bounded Perturbations (informal)
Lemma 3.2: Honest Majority Guarantee
Lemma E.1: Momentum Smoothing Bounds the Global Deviation
proof
Lemma E.2: Test Statistic Deviation
proof
Lemma E.3: Per--stage Lipschitz constants
proof
Lemma E.4: Sensitivity of Parameter Gradient to Activation Perturbation
...and 7 more

SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

TL;DR

Abstract

SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (17)