Table of Contents
Fetching ...

Privacy Preserving Data Imputation via Multi-party Computation for Medical Applications

Julia Jentsch, Ali Burak Ünal, Şeyma Selcan Mağara, Mete Akgün

TL;DR

This paper tackles missing data in medical datasets under strict privacy constraints by proposing privacy-preserving imputation methods realized through secure multi-party computation (MPC). It implements mean, median, regression, and kNN imputations within MPC and validates them on a diabetes dataset, reporting a maximum error of about $3 \times 10^{-3}$ relative to plaintext methods. The results show near-parity with plaintext imputation while exhibiting linear scalability with the number of samples, though kNN remains computationally intensive. The work enables collaborative, privacy-preserving data preprocessing in healthcare, laying a foundation for secure, multi-institution data analyses that rely on imputed data.

Abstract

Handling missing data is crucial in machine learning, but many datasets contain gaps due to errors or non-response. Unlike traditional methods such as listwise deletion, which are simple but inadequate, the literature offers more sophisticated and effective methods, thereby improving sample size and accuracy. However, these methods require accessing the whole dataset, which contradicts the privacy regulations when the data is distributed among multiple sources. Especially in the medical and healthcare domain, such access reveals sensitive information about patients. This study addresses privacy-preserving imputation methods for sensitive data using secure multi-party computation, enabling secure computations without revealing any party's sensitive information. In this study, we realized the mean, median, regression, and kNN imputation methods in a privacy-preserving way. We specifically target the medical and healthcare domains considering the significance of protection of the patient data, showcasing our methods on a diabetes dataset. Experiments on the diabetes dataset validated the correctness of our privacy-preserving imputation methods, yielding the largest error around $3 \times 10^{-3}$, closely matching plaintext methods. We also analyzed the scalability of our methods to varying numbers of samples, showing their applicability to real-world healthcare problems. Our analysis demonstrated that all our methods scale linearly with the number of samples. Except for kNN, the runtime of all our methods indicates that they can be utilized for large datasets.

Privacy Preserving Data Imputation via Multi-party Computation for Medical Applications

TL;DR

This paper tackles missing data in medical datasets under strict privacy constraints by proposing privacy-preserving imputation methods realized through secure multi-party computation (MPC). It implements mean, median, regression, and kNN imputations within MPC and validates them on a diabetes dataset, reporting a maximum error of about relative to plaintext methods. The results show near-parity with plaintext imputation while exhibiting linear scalability with the number of samples, though kNN remains computationally intensive. The work enables collaborative, privacy-preserving data preprocessing in healthcare, laying a foundation for secure, multi-institution data analyses that rely on imputed data.

Abstract

Handling missing data is crucial in machine learning, but many datasets contain gaps due to errors or non-response. Unlike traditional methods such as listwise deletion, which are simple but inadequate, the literature offers more sophisticated and effective methods, thereby improving sample size and accuracy. However, these methods require accessing the whole dataset, which contradicts the privacy regulations when the data is distributed among multiple sources. Especially in the medical and healthcare domain, such access reveals sensitive information about patients. This study addresses privacy-preserving imputation methods for sensitive data using secure multi-party computation, enabling secure computations without revealing any party's sensitive information. In this study, we realized the mean, median, regression, and kNN imputation methods in a privacy-preserving way. We specifically target the medical and healthcare domains considering the significance of protection of the patient data, showcasing our methods on a diabetes dataset. Experiments on the diabetes dataset validated the correctness of our privacy-preserving imputation methods, yielding the largest error around , closely matching plaintext methods. We also analyzed the scalability of our methods to varying numbers of samples, showing their applicability to real-world healthcare problems. Our analysis demonstrated that all our methods scale linearly with the number of samples. Except for kNN, the runtime of all our methods indicates that they can be utilized for large datasets.
Paper Structure (28 sections, 4 equations, 1 figure, 2 tables)

This paper contains 28 sections, 4 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The performance analysis of our privacy-preserving data imputation methods on varying numbers of sample sizes