Robust Collaborative Inference with Vertically Split Data Over Dynamic Device Environments

Surojit Ganguli; Zeyu Zhou; Christopher G. Brinton; David I. Inouye

Robust Collaborative Inference with Vertically Split Data Over Dynamic Device Environments

Surojit Ganguli, Zeyu Zhou, Christopher G. Brinton, David I. Inouye

TL;DR

The paper tackles robust inference for vertically split data across dynamic edge networks, introducing Dynamic Network VFL (DN-VFL) and the MAGS framework. MAGS combines (i) fault simulation during training via dropout (including communication dropout), (ii) replication of aggregators through MACL (and the low-cost 4-MACL variant), and (iii) gossip-based ensembling to reduce prediction variance at test time. A key theoretical insight shows that the fault-tolerant risk under dynamic conditions is bounded by a term that scales with the number of aggregators, and that gossiping lowers ensemble risk through diversity, with variance decaying as a function of the gossip rounds and graph spectral radius. Empirically, MAGS delivers strong robustness across high fault rates (up to 50%) on several datasets, often surpassing baselines by more than 20 percentage points and highlighting the practical viability of decentralized, fault-tolerant vertically split learning for safety-critical edge environments. This work establishes DN-VFL as a foundation for robust, privacy-conscious collaboration in dynamic networks and points to future extensions in asynchronous communication and privacy-preserving variants.

Abstract

When each edge device of a network only perceives a local part of the environment, collaborative inference across multiple devices is often needed to predict global properties of the environment. In safety-critical applications, collaborative inference must be robust to significant network failures caused by environmental disruptions or extreme weather. Existing collaborative learning approaches, such as privacy-focused Vertical Federated Learning (VFL), typically assume a centralized setup or that one device never fails. However, these assumptions make prior approaches susceptible to significant network failures. To address this problem, we first formalize the problem of robust collaborative inference over a dynamic network of devices that could experience significant network faults. Then, we develop a minimalistic yet impactful method called Multiple Aggregation with Gossip Rounds and Simulated Faults (MAGS) that synthesizes simulated faults via dropout, replication, and gossiping to significantly improve robustness over baselines. We also theoretically analyze our proposed approach to explain why each component enhances robustness. Extensive empirical results validate that MAGS is robust across a range of fault rates-including extreme fault rates.

Robust Collaborative Inference with Vertically Split Data Over Dynamic Device Environments

TL;DR

Abstract

Paper Structure (53 sections, 4 theorems, 10 equations, 21 figures, 8 tables, 1 algorithm)

This paper contains 53 sections, 4 theorems, 10 equations, 21 figures, 8 tables, 1 algorithm.

Introduction
Related works
Network Dynamic Resilient Federated Learning (FL)
Decentralized FL
Problem Formulation
Notation
Dynamic Network VFL Context
Data and Network Context
DN-VFL Problem Formulation via Dynamic Risk
Multiple Aggregation with Gossip Rounds and Simulated Faults (MAGS)
Decentralized Training of MAGS with Real and Simulated Faults via Dropout
Multiple Aggregators CL (MACL)
Gossip Layers to Ensemble Aggregator Predictions
Experiments
Datasets
...and 38 more sections

Key Result

Proposition 1

Given a device fault rate $r$, the number of data aggregators $K\leq C$ and the post-processing function $h_\mathrm{active\xspace}$, and assuming the risk of a predictor (data aggregator) with faults is higher than that without faults, then the dynamic risk with faults is lower bounded by:

Figures (21)

Figure 1: Collaborative Learning (CL)(\ref{['subfig:VFL-context']}) assumes samples are split across clients with a central server. The data context in our study is the same as VFL where the features are split across clients. However, in our case, no centralized server node is assumed, and clients serve as data aggregators (\ref{['subfig:iDecentralized-context']}). Our goal is to obtain robust test time performance even under highly dynamic networks such as client/device faults (①), server faults (②) and communication faults (③).
Figure 2: Test accuracy with and without communication (CD-) and party-wise (PD-) Dropout method for StarCraftMNIST with 16 devices. Here we include models trained under an dropout rate of 30% (marked by 'PD-' or 'CD-'). All results are averaged over 16 runs, and the error bar represents standard deviation. Across different configurations, MAGS(PD/CD-MACL-G4) trained with feature omissions has the highest average performance, while vanilla VFL performance is not robust as fault rate increases. As our experiments are repeated multiple times, what we report is the expectation (Avg) over the random active client selection.
Figure 3: Illustration of communication and device faults for a 3 device network for the MVFL method. (a) Fully connected MVFL setup. The check mark indicates that there is no fault in the final communication between device and special node (SN) as defined in Section \ref{['sec:distributed-inference']} of the main paper (b) Representation with communication faults. In this example communication from D1 to D2 and D2 to D3 is faulted. To account for the missing values, we do zero imputation. $X$ indicates that the communication between D3 and SN is faulted. Hence, the output at SN will be a class selected with uniform probability among all the classes (c) In device faults, the faulted device do not communicate with any other devices and missing values are accounted for by zero imputation. In this example, D2 is assumed to be faulted, hence the information from D2 is not passed to D1 or D3 and it does not produce an output. The output at SN for $D_2$ will be a class selected with uniform probability among all the classes
Figure 4: Illustration of Party wise and Communication wise dropout for a 3 device network for the MVFL method. (a) Fully connected MVFL setup. (b) For Party wise Dropout (PD), during training if D3 is dropped then none of the devices gets representations from D3 and the missing values are imputed by zeros. (c) In communication wise Dropout (CD) certain representations are omitted during training. In this example representations from from D2 to D1 and D1 to D3 are omitted by design during the training.
Figure 5: VFL as a baseline and the proposed innovations are illustrated for a network of two fully connected devices, D1 and D2. (a)VFL setup with D1 acting as a client as well as the aggregating server. The input to the devices at the first layer $L_1$ are $x_1$ and $x_2$ and output are the latent representations. The input to the server on the second layer $L_2$ is the concatenated latent representation and the output is the prediction $y_1$ (b) MVFL arrangement has both the devices acting as servers aside from being clients. (c) DMVFL has a similar arrangement as MVFL, expect that there is an additional layer of processing, $L_3$, that has the concatenated features from the previous layer as an input and the output are the predictions. (d) MVFL-G is an extension of MVFL wherein the output log probabilities ($Y_{i,L}$) from each device are averaged before being used for final prediction.
...and 16 more figures

Theorems & Definitions (11)

Definition 1: Dynamic Network Context
Definition 2: Device Fault Dynamic Network
Definition 3: Communication Fault Dynamic Network
Definition 4: Dynamic Risk
Proposition 1
Proposition 2
Proposition 3
Lemma 4: Conditional Client Selection Probability
proof : Proof of \ref{['thm:conditional-selection-prob']}
proof : Proof of \ref{['prop:k-MVFL']}
...and 1 more

Robust Collaborative Inference with Vertically Split Data Over Dynamic Device Environments

TL;DR

Abstract

Robust Collaborative Inference with Vertically Split Data Over Dynamic Device Environments

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (11)