Table of Contents
Fetching ...

Privacy-Preserving Dataset Combination

Keren Fuentes, Mimee Xu, Irene Chen

TL;DR

SecureKL introduces a zero-leakage, KL-divergence based protocol for privately evaluating potential dataset partnerships before data sharing. By leveraging secure multiparty computation, it computes dataset compatibility without exposing inputs and ranks partner candidates to maximize downstream AUC gains. Across ICU mortality prediction and Folktables income prediction, SecureKL achieves over 90% correlation with non-private baselines and outperforms privacy-leaking strategies, enabling practical data collaborations in regulated domains. This privacy-preserving data appraisal stage promises to increase data utilization, reduce reliance on data marketplaces, and promote equitable data access while maintaining stringent privacy guarantees.

Abstract

Access to diverse, high-quality datasets is crucial for machine learning model performance, yet data sharing remains limited by privacy concerns and competitive interests, particularly in regulated domains like healthcare. This dynamic especially disadvantages smaller organizations that lack resources to purchase data or negotiate favorable sharing agreements, due to the inability to \emph{privately} assess external data's utility. To resolve privacy and uncertainty tensions simultaneously, we introduce {\SecureKL}, the first secure protocol for dataset-to-dataset evaluations with zero privacy leakage, designed to be applied preceding data sharing. {\SecureKL} evaluates a source dataset against candidates, performing dataset divergence metrics internally with private computations, all without assuming downstream models. On real-world data, {\SecureKL} achieves high consistency ($>90\%$ correlation with non-private counterparts) and successfully identifies beneficial data collaborations in highly-heterogeneous domains (ICU mortality prediction across hospitals and income prediction across states). Our results highlight that secure computation maximizes data utilization, outperforming privacy-agnostic utility assessments that leak information.

Privacy-Preserving Dataset Combination

TL;DR

SecureKL introduces a zero-leakage, KL-divergence based protocol for privately evaluating potential dataset partnerships before data sharing. By leveraging secure multiparty computation, it computes dataset compatibility without exposing inputs and ranks partner candidates to maximize downstream AUC gains. Across ICU mortality prediction and Folktables income prediction, SecureKL achieves over 90% correlation with non-private baselines and outperforms privacy-leaking strategies, enabling practical data collaborations in regulated domains. This privacy-preserving data appraisal stage promises to increase data utilization, reduce reliance on data marketplaces, and promote equitable data access while maintaining stringent privacy guarantees.

Abstract

Access to diverse, high-quality datasets is crucial for machine learning model performance, yet data sharing remains limited by privacy concerns and competitive interests, particularly in regulated domains like healthcare. This dynamic especially disadvantages smaller organizations that lack resources to purchase data or negotiate favorable sharing agreements, due to the inability to \emph{privately} assess external data's utility. To resolve privacy and uncertainty tensions simultaneously, we introduce {\SecureKL}, the first secure protocol for dataset-to-dataset evaluations with zero privacy leakage, designed to be applied preceding data sharing. {\SecureKL} evaluates a source dataset against candidates, performing dataset divergence metrics internally with private computations, all without assuming downstream models. On real-world data, {\SecureKL} achieves high consistency ( correlation with non-private counterparts) and successfully identifies beneficial data collaborations in highly-heterogeneous domains (ICU mortality prediction across hospitals and income prediction across states). Our results highlight that secure computation maximizes data utilization, outperforming privacy-agnostic utility assessments that leak information.

Paper Structure

This paper contains 62 sections, 13 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Privacy can dis-incentivize data collaborations. Without seeing external data, an organization has two strategies: i. blind default $\pi_0$: randomly selecting partnerships causes hesitation and hinders partnerships. ii. private evaluation $\pi_p$: securely assessing datasets before commitment.
  • Figure 2: The Dataset Combination Problem. Real-world data collaborations are inherently uncertain. AUC change for a source entity, after incorporating external data across hospitals and states. Each entry representas training on a pairwise data combination for a test entity (x-axis) Each column represents training on a data combination for a fixed test dataset, formed by pairing the fixed entity with an external dataset in the y-axis. Left: In eICU Pollard2018TheEC, 10 out of 12 hospitals may see their mortality prediction model degrade for some potential hospital partners. Right: In Folktables Ding2021RetiringAN, combining with random state leads to worse income prediction in 10 out of 15 states. (red is bad; exact values are reported in Appendix \ref{['app:folktables']})
  • Figure 3: Non-private evaluation strategies. iii. sub sampling $\pi_s$: a subset of the target's data is shared. iv. demographic summaries $\pi_d$: the target entity discloses distributions by protected attributes, i.e. age, gender, or race.
  • Figure 4: Our method SecureKL. Each side encrypts their data. Then, a model is privately trained on their joint data. Afterwards, their divergence is computed. Finally, only the final result of this dataset-to-dataset evaluation is revealed.
  • Figure 5: SecureKL: Overall Correctness. Rank correlation between SecureKL output and ground truth AUC change, $\delta_i$, from acquiring $1$ additional dataset for a given source hospital $H_o$. We propose selecting data partner ranked by our secure system under $\mathrm{Secure}\mathrm{KL}_{\mathcal{X}\mathcal{Y}}$ score to reliably increase AUC gains. ($|\mathbf{H}|= 12$ hospitals; colored by source.)
  • ...and 6 more figures