Table of Contents
Fetching ...

Robust Collaborative Inference with Vertically Split Data Over Dynamic Device Environments

Surojit Ganguli, Zeyu Zhou, Christopher G. Brinton, David I. Inouye

TL;DR

The paper tackles robust inference for vertically split data across dynamic edge networks, introducing Dynamic Network VFL (DN-VFL) and the MAGS framework. MAGS combines (i) fault simulation during training via dropout (including communication dropout), (ii) replication of aggregators through MACL (and the low-cost 4-MACL variant), and (iii) gossip-based ensembling to reduce prediction variance at test time. A key theoretical insight shows that the fault-tolerant risk under dynamic conditions is bounded by a term that scales with the number of aggregators, and that gossiping lowers ensemble risk through diversity, with variance decaying as a function of the gossip rounds and graph spectral radius. Empirically, MAGS delivers strong robustness across high fault rates (up to 50%) on several datasets, often surpassing baselines by more than 20 percentage points and highlighting the practical viability of decentralized, fault-tolerant vertically split learning for safety-critical edge environments. This work establishes DN-VFL as a foundation for robust, privacy-conscious collaboration in dynamic networks and points to future extensions in asynchronous communication and privacy-preserving variants.

Abstract

When each edge device of a network only perceives a local part of the environment, collaborative inference across multiple devices is often needed to predict global properties of the environment. In safety-critical applications, collaborative inference must be robust to significant network failures caused by environmental disruptions or extreme weather. Existing collaborative learning approaches, such as privacy-focused Vertical Federated Learning (VFL), typically assume a centralized setup or that one device never fails. However, these assumptions make prior approaches susceptible to significant network failures. To address this problem, we first formalize the problem of robust collaborative inference over a dynamic network of devices that could experience significant network faults. Then, we develop a minimalistic yet impactful method called Multiple Aggregation with Gossip Rounds and Simulated Faults (MAGS) that synthesizes simulated faults via dropout, replication, and gossiping to significantly improve robustness over baselines. We also theoretically analyze our proposed approach to explain why each component enhances robustness. Extensive empirical results validate that MAGS is robust across a range of fault rates-including extreme fault rates.

Robust Collaborative Inference with Vertically Split Data Over Dynamic Device Environments

TL;DR

The paper tackles robust inference for vertically split data across dynamic edge networks, introducing Dynamic Network VFL (DN-VFL) and the MAGS framework. MAGS combines (i) fault simulation during training via dropout (including communication dropout), (ii) replication of aggregators through MACL (and the low-cost 4-MACL variant), and (iii) gossip-based ensembling to reduce prediction variance at test time. A key theoretical insight shows that the fault-tolerant risk under dynamic conditions is bounded by a term that scales with the number of aggregators, and that gossiping lowers ensemble risk through diversity, with variance decaying as a function of the gossip rounds and graph spectral radius. Empirically, MAGS delivers strong robustness across high fault rates (up to 50%) on several datasets, often surpassing baselines by more than 20 percentage points and highlighting the practical viability of decentralized, fault-tolerant vertically split learning for safety-critical edge environments. This work establishes DN-VFL as a foundation for robust, privacy-conscious collaboration in dynamic networks and points to future extensions in asynchronous communication and privacy-preserving variants.

Abstract

When each edge device of a network only perceives a local part of the environment, collaborative inference across multiple devices is often needed to predict global properties of the environment. In safety-critical applications, collaborative inference must be robust to significant network failures caused by environmental disruptions or extreme weather. Existing collaborative learning approaches, such as privacy-focused Vertical Federated Learning (VFL), typically assume a centralized setup or that one device never fails. However, these assumptions make prior approaches susceptible to significant network failures. To address this problem, we first formalize the problem of robust collaborative inference over a dynamic network of devices that could experience significant network faults. Then, we develop a minimalistic yet impactful method called Multiple Aggregation with Gossip Rounds and Simulated Faults (MAGS) that synthesizes simulated faults via dropout, replication, and gossiping to significantly improve robustness over baselines. We also theoretically analyze our proposed approach to explain why each component enhances robustness. Extensive empirical results validate that MAGS is robust across a range of fault rates-including extreme fault rates.
Paper Structure (53 sections, 4 theorems, 10 equations, 21 figures, 8 tables, 1 algorithm)

This paper contains 53 sections, 4 theorems, 10 equations, 21 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

Given a device fault rate $r$, the number of data aggregators $K\leq C$ and the post-processing function $h_\mathrm{active\xspace}$, and assuming the risk of a predictor (data aggregator) with faults is higher than that without faults, then the dynamic risk with faults is lower bounded by:

Figures (21)

  • Figure 1: Collaborative Learning (CL)(\ref{['subfig:VFL-context']}) assumes samples are split across clients with a central server. The data context in our study is the same as VFL where the features are split across clients. However, in our case, no centralized server node is assumed, and clients serve as data aggregators (\ref{['subfig:iDecentralized-context']}). Our goal is to obtain robust test time performance even under highly dynamic networks such as client/device faults (①), server faults (②) and communication faults (③).
  • Figure 2: Test accuracy with and without communication (CD-) and party-wise (PD-) Dropout method for StarCraftMNIST with 16 devices. Here we include models trained under an dropout rate of 30% (marked by 'PD-' or 'CD-'). All results are averaged over 16 runs, and the error bar represents standard deviation. Across different configurations, MAGS(PD/CD-MACL-G4) trained with feature omissions has the highest average performance, while vanilla VFL performance is not robust as fault rate increases. As our experiments are repeated multiple times, what we report is the expectation (Avg) over the random active client selection.
  • Figure 3: Illustration of communication and device faults for a 3 device network for the MVFL method. (a) Fully connected MVFL setup. The check mark indicates that there is no fault in the final communication between device and special node (SN) as defined in Section \ref{['sec:distributed-inference']} of the main paper (b) Representation with communication faults. In this example communication from D1 to D2 and D2 to D3 is faulted. To account for the missing values, we do zero imputation. $X$ indicates that the communication between D3 and SN is faulted. Hence, the output at SN will be a class selected with uniform probability among all the classes (c) In device faults, the faulted device do not communicate with any other devices and missing values are accounted for by zero imputation. In this example, D2 is assumed to be faulted, hence the information from D2 is not passed to D1 or D3 and it does not produce an output. The output at SN for $D_2$ will be a class selected with uniform probability among all the classes
  • Figure 4: Illustration of Party wise and Communication wise dropout for a 3 device network for the MVFL method. (a) Fully connected MVFL setup. (b) For Party wise Dropout (PD), during training if D3 is dropped then none of the devices gets representations from D3 and the missing values are imputed by zeros. (c) In communication wise Dropout (CD) certain representations are omitted during training. In this example representations from from D2 to D1 and D1 to D3 are omitted by design during the training.
  • Figure 5: VFL as a baseline and the proposed innovations are illustrated for a network of two fully connected devices, D1 and D2. (a)VFL setup with D1 acting as a client as well as the aggregating server. The input to the devices at the first layer $L_1$ are $x_1$ and $x_2$ and output are the latent representations. The input to the server on the second layer $L_2$ is the concatenated latent representation and the output is the prediction $y_1$ (b) MVFL arrangement has both the devices acting as servers aside from being clients. (c) DMVFL has a similar arrangement as MVFL, expect that there is an additional layer of processing, $L_3$, that has the concatenated features from the previous layer as an input and the output are the predictions. (d) MVFL-G is an extension of MVFL wherein the output log probabilities ($Y_{i,L}$) from each device are averaged before being used for final prediction.
  • ...and 16 more figures

Theorems & Definitions (11)

  • Definition 1: Dynamic Network Context
  • Definition 2: Device Fault Dynamic Network
  • Definition 3: Communication Fault Dynamic Network
  • Definition 4: Dynamic Risk
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Lemma 4: Conditional Client Selection Probability
  • proof : Proof of \ref{['thm:conditional-selection-prob']}
  • proof : Proof of \ref{['prop:k-MVFL']}
  • ...and 1 more