Scalable Vertical Federated Learning via Data Augmentation and Amortized Inference
Conor Hassan, Matthew Sutton, Antonietta Mira, Kerrie Mengersen
TL;DR
This paper提出 a first Bayesian framework for vertical federated learning (VFL) by leveraging asymptotically-exact data augmentation (AXDA) to create conditional independence across clients via auxiliary variables, enabling decentralized posterior inference. It develops two AXDA-based models for VFL—the augmented-variable model and the power-likelihood model—and pairs them with factorized amortized or mean-field variational approximations to achieve scalability even when augmentation grows with data and clients. Through logistic regression, Poisson multilevel regression, and a hierarchical Bayes split neural net, the work demonstrates competitive inference with privacy-friendly, decentralized updates and shows that the power-likelihood formulation often yields higher ELBOs than the augmented-variable approach. The results highlight the potential for privacy-preserving, decentralized Bayesian inference in vertically partitioned data, offering a foundation for future asynchronous updates, model selection mechanisms, and scalable Bayesian VFL deployments.
Abstract
Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms. We present an innovative model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods. To mitigate the dimensionality challenge posed by data augmentation, which scales with the number of observations and clients, we develop a factorized amortized variational approximation that achieves scalability independent of the number of observations. We showcase the efficacy of our framework through extensive numerical experiments on logistic regression, multilevel regression, and a novel hierarchical Bayesian split neural net model. Our work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios, opening up new avenues for research and applications in various domains.
