Table of Contents
Fetching ...

SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputugodage, Gil Avraham, Yan Zuo, Violetta Shevchenko, Alexander Long

TL;DR

This work proposes SENTINEL, a verification mechanism for PP training without computation duplication that verifies sequential activation/gradient transmission between layers, and provides theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training.

Abstract

Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes. While existing Byzantine-tolerant literature addresses data parallel (DP) training through robust aggregation methods, pipeline parallelism (PP) presents fundamentally distinct challenges. In PP, model layers are distributed across workers where the activations and their gradients flow between stages rather than being aggregated, making traditional DP approaches inapplicable. We propose SENTINEL, a verification mechanism for PP training without computation duplication. SENTINEL employs lightweight momentum-based monitoring using exponential moving averages (EMAs) to detect corrupted inter-stage communication. Unlike existing Byzantine-tolerant approaches for DP that aggregate parameter gradients across replicas, our approach verifies sequential activation/gradient transmission between layers. We provide theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training. Experiments demonstrate successful training of up to 4B-parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.

SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

TL;DR

This work proposes SENTINEL, a verification mechanism for PP training without computation duplication that verifies sequential activation/gradient transmission between layers, and provides theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training.

Abstract

Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes. While existing Byzantine-tolerant literature addresses data parallel (DP) training through robust aggregation methods, pipeline parallelism (PP) presents fundamentally distinct challenges. In PP, model layers are distributed across workers where the activations and their gradients flow between stages rather than being aggregated, making traditional DP approaches inapplicable. We propose SENTINEL, a verification mechanism for PP training without computation duplication. SENTINEL employs lightweight momentum-based monitoring using exponential moving averages (EMAs) to detect corrupted inter-stage communication. Unlike existing Byzantine-tolerant approaches for DP that aggregate parameter gradients across replicas, our approach verifies sequential activation/gradient transmission between layers. We provide theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training. Experiments demonstrate successful training of up to 4B-parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.
Paper Structure (95 sections, 9 theorems, 72 equations, 22 figures, 20 tables, 7 algorithms)

This paper contains 95 sections, 9 theorems, 72 equations, 22 figures, 20 tables, 7 algorithms.

Key Result

Theorem 3.1

Consider a distributed training setup that utilizes PP to split the model layers across workers and uses momentum-based verification to verify each worker's contribution in the forward or backward pass.We do not consider dishonest activity during the "all-reduce" operation for syncing parameter grad

Figures (22)

  • Figure 1: Scatter plot of F1-scores vs. validation loss for more than 75 different attack setups. Strong attacks (higher loss) are caught more often (high F1-score), while weak attacks may slip through. Our verification method thus catches the most harmful attacks that would disrupt training.
  • Figure 2: Distributed threat models. (a) In DP, workers hold full model replicas and only send weight gradients. Traditional Byzantine-tolerant methods consider this case and use robust aggregation. (b & c)The threat model considered in this paper (see \ref{['sec:problem_statement:threat_model']}). In PP, workers hold individual layers, and send intermediate activations$\boldsymbol{h}$ and activation gradients$\boldsymbol{g}$, thus corruptions directly affect other workers. (b) In the standard setting there is a fixed grid of pipeline stages and data parallel replicas, and communication is routed through our verifier nodes. (c) In the SWARM setting designed for decentralized setups, data is stochastically routed by trainer nodes. Workers send their computations ($\boldsymbol{h}$ and $\boldsymbol{g}$) to trainers, who then route them to an available worker in the next stage. Our proposed verifiers can seamlessly be added on top of these trainer nodes (\ref{['sec:appendix:swarm']}).
  • Figure 3: Ablation studies on the effect of various elements on the verification performance.
  • Figure 4: Loss when training Llama-3-0.6B models using SWARM ryabinin2023swarm with 128 distributed workers on preemptible AWS instances. Workers employ various activation/gradient manipulation attacks to disrupt training. While in the absence of verification training gets disrupted, Sentinel can successfully protect training from divergence.
  • Figure 5: Verification protocol for handling compromised workers during distributed training. During forward propagation, a worker at stage $s+1$ is detected as potentially compromised (shown in yellow). The verifier nodes continue forwarding activations to subsequent stages without alerting downstream workers to avoid disrupting the pipeline. During backward propagation, instead of propagating gradients computed by the compromised worker, verifier nodes substitute gradient momentum values to maintain training stability. Communication flows through verifier nodes between consecutive pipeline stages, though direct worker-to-worker arrows are shown for visual clarity.
  • ...and 17 more figures

Theorems & Definitions (17)

  • Definition 2.1: Data and Pipeline Parallel Threat Model
  • Theorem 3.1: Convergence Under Bounded Perturbations (informal)
  • Lemma 3.2: Honest Majority Guarantee
  • Lemma E.1: Momentum Smoothing Bounds the Global Deviation
  • proof
  • Lemma E.2: Test Statistic Deviation
  • proof
  • Lemma E.3: Per--stage Lipschitz constants
  • proof
  • Lemma E.4: Sensitivity of Parameter Gradient to Activation Perturbation
  • ...and 7 more