Table of Contents
Fetching ...

VertiBayes: Learning Bayesian network parameters from vertically partitioned data with missing values

Florian van Daalen, Lianne Ippel, Andre Dekker, Inigo Bermejo

TL;DR

VertiBayes addresses the challenge of learning Bayesian network parameters and structure from vertically partitioned data with missing values in a privacy-preserving federated setting. It adapts structure learning via the K2 algorithm using a privacy-preserving scalar-product protocol and tackles parameter learning with a three-step approach that handles missing data by first training an intermediate model, generating synthetic data, and then applying EM on the synthetic data. The method demonstrates that BN models learned under vertical partitioning achieve performance comparable to centrally trained models, while providing privacy guarantees aligned with the scalar-product protocol and offering privacy-preserving validation strategies SCV and SVDG. The approach scales to an arbitrary number of parties, with runtime dominated by CPD size and protocol overhead, and highlights practical considerations around discretization, validation, and information leakage when publishing BN components.

Abstract

Federated learning makes it possible to train a machine learning model on decentralized data. Bayesian networks are probabilistic graphical models that have been widely used in artificial intelligence applications. Their popularity stems from the fact they can be built by combining existing expert knowledge with data and are highly interpretable, which makes them useful for decision support, e.g. in healthcare. While some research has been published on the federated learning of Bayesian networks, publications on Bayesian networks in a vertically partitioned or heterogeneous data setting (where different variables are located in different datasets) are limited, and suffer from important omissions, such as the handling of missing data. In this article, we propose a novel method called VertiBayes to train Bayesian networks (structure and parameters) on vertically partitioned data, which can handle missing values as well as an arbitrary number of parties. For structure learning we adapted the widely used K2 algorithm with a privacy-preserving scalar product protocol. For parameter learning, we use a two-step approach: first, we learn an intermediate model using maximum likelihood by treating missing values as a special value and then we train a model on synthetic data generated by the intermediate model using the EM algorithm. The privacy guarantees of our approach are equivalent to the ones provided by the privacy preserving scalar product protocol used. We experimentally show our approach produces models comparable to those learnt using traditional algorithms and we estimate the increase in complexity in terms of samples, network size, and complexity. Finally, we propose two alternative approaches to estimate the performance of the model using vertically partitioned data and we show in experiments that they lead to reasonably accurate estimates.

VertiBayes: Learning Bayesian network parameters from vertically partitioned data with missing values

TL;DR

VertiBayes addresses the challenge of learning Bayesian network parameters and structure from vertically partitioned data with missing values in a privacy-preserving federated setting. It adapts structure learning via the K2 algorithm using a privacy-preserving scalar-product protocol and tackles parameter learning with a three-step approach that handles missing data by first training an intermediate model, generating synthetic data, and then applying EM on the synthetic data. The method demonstrates that BN models learned under vertical partitioning achieve performance comparable to centrally trained models, while providing privacy guarantees aligned with the scalar-product protocol and offering privacy-preserving validation strategies SCV and SVDG. The approach scales to an arbitrary number of parties, with runtime dominated by CPD size and protocol overhead, and highlights practical considerations around discretization, validation, and information leakage when publishing BN components.

Abstract

Federated learning makes it possible to train a machine learning model on decentralized data. Bayesian networks are probabilistic graphical models that have been widely used in artificial intelligence applications. Their popularity stems from the fact they can be built by combining existing expert knowledge with data and are highly interpretable, which makes them useful for decision support, e.g. in healthcare. While some research has been published on the federated learning of Bayesian networks, publications on Bayesian networks in a vertically partitioned or heterogeneous data setting (where different variables are located in different datasets) are limited, and suffer from important omissions, such as the handling of missing data. In this article, we propose a novel method called VertiBayes to train Bayesian networks (structure and parameters) on vertically partitioned data, which can handle missing values as well as an arbitrary number of parties. For structure learning we adapted the widely used K2 algorithm with a privacy-preserving scalar product protocol. For parameter learning, we use a two-step approach: first, we learn an intermediate model using maximum likelihood by treating missing values as a special value and then we train a model on synthetic data generated by the intermediate model using the EM algorithm. The privacy guarantees of our approach are equivalent to the ones provided by the privacy preserving scalar product protocol used. We experimentally show our approach produces models comparable to those learnt using traditional algorithms and we estimate the increase in complexity in terms of samples, network size, and complexity. Finally, we propose two alternative approaches to estimate the performance of the model using vertically partitioned data and we show in experiments that they lead to reasonably accurate estimates.
Paper Structure (25 sections, 2 equations, 2 figures, 2 tables)

This paper contains 25 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Training process for VertiBayes
  • Figure 2: Flow diagrams for proposed validation procedures SCV (left) and SVDG (right)