PROVGEN: A Privacy-Preserving Approach for Outcome Validation in Genomic Research
Yuzhou Jiang, Tianxi Ji, Erman Ayday
TL;DR
PROVGEN tackles the challenge of privacy-preserving genomic data sharing for GWAS outcome validation by encoding SNPs into binary form, perturbing with a correlation-aware XOR-based DP mechanism, and applying a MA F-alignment post-processing step via optimal transport to restore GWAS utility. The two-stage approach maintains differential privacy while enabling reproducibility and error-detection in GWAS results, outperforming local DP and synthesis-based baselines on multiple metrics and datasets. The framework demonstrates robust GWAS outcome validation, data utility, and resistance to membership inference attacks, with favorable time complexity for large-scale genomic data. This work offers a practical path toward transparent, verifiable genomic research without compromising individual privacy.
Abstract
As genomic research has grown increasingly popular in recent years, dataset sharing has remained limited due to privacy concerns. This limitation hinders the reproducibility and validation of research outcomes, both of which are essential for identifying computational errors during the research process. In this paper, we introduce PROVGEN, a privacy-preserving method for sharing genomic datasets that facilitates reproducibility and outcome validation in genome-wide association studies (GWAS). Our approach encodes genomic data into binary space and applies a two-stage process. First, we generate a differentially private version of the dataset using an XOR-based mechanism that incorporates biological characteristics. Second, we restore data utility by adjusting the Minor Allele Frequency (MAF) values in the noisy dataset to align with published MAFs using optimal transport. Finally, we convert the processed binary data back into its genomic representation and publish the resulting dataset. We evaluate PROVGEN on three real-world genomic datasets and compare it with local differential privacy and three synthesis-based methods. We show that our proposed scheme outperforms all existing methods in detecting GWAS outcome errors, achieves better data utility, and provides higher privacy protection against membership inference attacks (MIAs). By adopting our method, genomic researchers will be inclined to share differentially private datasets while maintaining high data quality for reproducibility of their findings.
