Table of Contents
Fetching ...

PROVGEN: A Privacy-Preserving Approach for Outcome Validation in Genomic Research

Yuzhou Jiang, Tianxi Ji, Erman Ayday

TL;DR

PROVGEN tackles the challenge of privacy-preserving genomic data sharing for GWAS outcome validation by encoding SNPs into binary form, perturbing with a correlation-aware XOR-based DP mechanism, and applying a MA F-alignment post-processing step via optimal transport to restore GWAS utility. The two-stage approach maintains differential privacy while enabling reproducibility and error-detection in GWAS results, outperforming local DP and synthesis-based baselines on multiple metrics and datasets. The framework demonstrates robust GWAS outcome validation, data utility, and resistance to membership inference attacks, with favorable time complexity for large-scale genomic data. This work offers a practical path toward transparent, verifiable genomic research without compromising individual privacy.

Abstract

As genomic research has grown increasingly popular in recent years, dataset sharing has remained limited due to privacy concerns. This limitation hinders the reproducibility and validation of research outcomes, both of which are essential for identifying computational errors during the research process. In this paper, we introduce PROVGEN, a privacy-preserving method for sharing genomic datasets that facilitates reproducibility and outcome validation in genome-wide association studies (GWAS). Our approach encodes genomic data into binary space and applies a two-stage process. First, we generate a differentially private version of the dataset using an XOR-based mechanism that incorporates biological characteristics. Second, we restore data utility by adjusting the Minor Allele Frequency (MAF) values in the noisy dataset to align with published MAFs using optimal transport. Finally, we convert the processed binary data back into its genomic representation and publish the resulting dataset. We evaluate PROVGEN on three real-world genomic datasets and compare it with local differential privacy and three synthesis-based methods. We show that our proposed scheme outperforms all existing methods in detecting GWAS outcome errors, achieves better data utility, and provides higher privacy protection against membership inference attacks (MIAs). By adopting our method, genomic researchers will be inclined to share differentially private datasets while maintaining high data quality for reproducibility of their findings.

PROVGEN: A Privacy-Preserving Approach for Outcome Validation in Genomic Research

TL;DR

PROVGEN tackles the challenge of privacy-preserving genomic data sharing for GWAS outcome validation by encoding SNPs into binary form, perturbing with a correlation-aware XOR-based DP mechanism, and applying a MA F-alignment post-processing step via optimal transport to restore GWAS utility. The two-stage approach maintains differential privacy while enabling reproducibility and error-detection in GWAS results, outperforming local DP and synthesis-based baselines on multiple metrics and datasets. The framework demonstrates robust GWAS outcome validation, data utility, and resistance to membership inference attacks, with favorable time complexity for large-scale genomic data. This work offers a practical path toward transparent, verifiable genomic research without compromising individual privacy.

Abstract

As genomic research has grown increasingly popular in recent years, dataset sharing has remained limited due to privacy concerns. This limitation hinders the reproducibility and validation of research outcomes, both of which are essential for identifying computational errors during the research process. In this paper, we introduce PROVGEN, a privacy-preserving method for sharing genomic datasets that facilitates reproducibility and outcome validation in genome-wide association studies (GWAS). Our approach encodes genomic data into binary space and applies a two-stage process. First, we generate a differentially private version of the dataset using an XOR-based mechanism that incorporates biological characteristics. Second, we restore data utility by adjusting the Minor Allele Frequency (MAF) values in the noisy dataset to align with published MAFs using optimal transport. Finally, we convert the processed binary data back into its genomic representation and publish the resulting dataset. We evaluate PROVGEN on three real-world genomic datasets and compare it with local differential privacy and three synthesis-based methods. We show that our proposed scheme outperforms all existing methods in detecting GWAS outcome errors, achieves better data utility, and provides higher privacy protection against membership inference attacks (MIAs). By adopting our method, genomic researchers will be inclined to share differentially private datasets while maintaining high data quality for reproducibility of their findings.
Paper Structure (45 sections, 4 theorems, 23 equations, 9 figures, 7 tables)

This paper contains 45 sections, 4 theorems, 23 equations, 9 figures, 7 tables.

Key Result

proposition 1

dwork2014algorithmic Let $\mathcal{M}$ be a randomized algorithm that is $\epsilon$-differentially private. For any arbitrary randomized mapping $f:\mathcal{R}^q \rightarrow \mathcal{R}^r$ where $p,q \in \mathbb{N}^+$, $f\circ \mathcal{M}$ is $\epsilon$-differentially private.

Figures (9)

  • Figure 1: The workflow of PROVGEN operates as follows: 1) The input dataset $D$ is encoded into a binary form $D^b$ and subjected to an XOR operation with binary noise, generated through Efficient Binary Noise Generation (EBNG). 2) We utilize the Minor Allele Frequencies (MAFs) of SNPs that are published in the research findings to enhance the data utility of the noisy dataset $\hat{D}^b$ using optimal transport. Finally, we convert the optimized binary dataset $\hat{D}^b$ back into its original SNP format to obtain the final shared dataset $D'$.
  • Figure 2: Performance of GWAS outcome validation for the $\chi^2$ test against flipping errors between ours and LDP kasiviswanathan2011can.
  • Figure 3: Performance of GWAS outcome validation for the $\chi^2$ test against noise errors between ours and LDP kasiviswanathan2011can.
  • Figure 4: Performance of GWAS outcome validation for the odds ratio test against flipping errors between ours and LDP kasiviswanathan2011can.
  • Figure 5: Performance of GWAS outcome validation for the odds ratio test against noise errors between ours and LDP kasiviswanathan2011can.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Definition 1: Differential Privacy
  • proposition 1: Post-processing
  • Definition 2: XOR Mechanism
  • Theorem 3.1
  • Lemma 6.1
  • Definition 3: Efficient Binary Noise Generation
  • Theorem 6.2