Table of Contents
Fetching ...

Towards Biologically Plausible and Private Gene Expression Data Generation

Dingfan Chen, Marie Oestreich, Tejumade Afonja, Raouf Kerkouche, Matthias Becker, Mario Fritz

TL;DR

This work evaluates five differentially private generative models (RON-Gauss, VAE, GAN, Private-PGM, PrivSyn) for real-world gene expression data across utility, statistical fidelity, and biological plausibility. While several methods achieve strong downstream utility under DP, none fully preserve biologically meaningful characteristics such as differential expression and gene co-expression, revealing a disconnect between standard utility metrics and biological realism. The study highlights the necessity of multi-dimensional evaluation to avoid over-optimistic assessments and suggests that VAEs and certain graphical-model-based approaches may fare better in balancing privacy with some biological structure, though further improvements are needed. The authors contribute a systematic benchmarking framework and publicly available setup to drive progress toward biologically plausible and privacy-preserving synthetic gene expression data.

Abstract

Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications. Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions. In this paper, we initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data. We conduct a comprehensive analysis of five representative DP generation methods, examining them from various angles, such as downstream utility, statistical properties, and biological plausibility. Our extensive evaluation illuminates the unique characteristics of each DP generation method, offering critical insights into the strengths and weaknesses of each approach, and uncovering intriguing possibilities for future developments. Perhaps surprisingly, our analysis reveals that most methods are capable of achieving seemingly reasonable downstream utility, according to the standard evaluation metrics considered in existing literature. Nevertheless, we find that none of the DP methods are able to accurately capture the biological characteristics of the real dataset. This observation suggests a potential over-optimistic assessment of current methodologies in this field and underscores a pressing need for future enhancements in model design.

Towards Biologically Plausible and Private Gene Expression Data Generation

TL;DR

This work evaluates five differentially private generative models (RON-Gauss, VAE, GAN, Private-PGM, PrivSyn) for real-world gene expression data across utility, statistical fidelity, and biological plausibility. While several methods achieve strong downstream utility under DP, none fully preserve biologically meaningful characteristics such as differential expression and gene co-expression, revealing a disconnect between standard utility metrics and biological realism. The study highlights the necessity of multi-dimensional evaluation to avoid over-optimistic assessments and suggests that VAEs and certain graphical-model-based approaches may fare better in balancing privacy with some biological structure, though further improvements are needed. The authors contribute a systematic benchmarking framework and publicly available setup to drive progress toward biologically plausible and privacy-preserving synthetic gene expression data.

Abstract

Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications. Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions. In this paper, we initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data. We conduct a comprehensive analysis of five representative DP generation methods, examining them from various angles, such as downstream utility, statistical properties, and biological plausibility. Our extensive evaluation illuminates the unique characteristics of each DP generation method, offering critical insights into the strengths and weaknesses of each approach, and uncovering intriguing possibilities for future developments. Perhaps surprisingly, our analysis reveals that most methods are capable of achieving seemingly reasonable downstream utility, according to the standard evaluation metrics considered in existing literature. Nevertheless, we find that none of the DP methods are able to accurately capture the biological characteristics of the real dataset. This observation suggests a potential over-optimistic assessment of current methodologies in this field and underscores a pressing need for future enhancements in model design.
Paper Structure (44 sections, 1 theorem, 11 equations, 18 figures, 2 tables, 1 algorithm)

This paper contains 44 sections, 1 theorem, 11 equations, 18 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

If $\mathcal{M}$ satisfies $(\varepsilon,\delta)$-DP, $F\circ \mathcal{M}$ will satisfy $(\varepsilon,\delta)$-DP for any data-independent function $F$ with $\circ$ denoting the composition operator.

Figures (18)

  • Figure 1: Utility Evaluation by Machine Learning Efficacy, and Statistical Evaluation by Histogram Intersection and Distance to Closest Record. Shown in (a) are the Accuracy Scores for the Machine Learning Efficacy metric across 5 various models for the DP-case (blue shading) with varying $\varepsilon$ values, alongside the non-private case. Similarly, (b) and (c) display the Overlap Score and K-Nearest Neighbors Distance Score for the Histogram Intersection metric and Distance to Closest Record metric, respectively. Evaluations encompassed two seeds for training split creation and two synthetic dataset randomizations. The presented values represent means across these randomization seeds. The black dashed line represents the reference score on actual train-test data, signifying the best attainable score.
  • Figure 2: Biological Evaluation by DE-Gene Preservation. Shown is the preservation of DE-genes (true positive rate (TPR): solid lines; false positive rate (FPR): dashed lines) across the tested models for the DP-case (indicated by blue shading) with different values of $\varepsilon$ and the non-private case. The evaluation was performed for two different seeds used for creating the training split (left and right plot). The presented values are means across two different seeds set for generating the data (except for Private-PGM and PrivSyn, where seeding is not possible).
  • Figure 3: Biological Evaluation by Co-Expression Preservation for $r$ > 0. Shown is the co-expression preservation across the tested models for different values of $\varepsilon$ as well as the non-private case for two different seeds used for creating the training split (left and right plot). Specifically, non-transparent bars give the number of correctly reconstructed co-expressions with Pearson Correlation Coefficient $r$ > 0 and an associated p-value < 0.05, while semi-transparent bars give the number of co-expressions introduced by the model that did not exist in the real data. The dashed black line indicates the number of co-expressions in the real data. All values shown are means across two different seeds set for generating the data (except for Private-PGM and PrivSyn, where seeding is not possible).
  • Figure 4: Activation patterns of co-expressed gene modules in VAE for $r$ > 0. Shown are the Group Fold Changes (GFCs) of gene modules (rows) in the real and the synthetic data sampled with two different seeds. The dendrograms representing the hierarchical clustering of the sample groups differentiated by label class and seed, with each column corresponding to a distinct group. Optimally, samples with the same label classes should be adjacent, indicating that they are clustered together. Numbers on the right indicate the number of genes per module, numbers in square brackets on the bottom indicate the number of samples per condition and dataset. Darker shades of red imply activation of the gene module, while darker shades of blue indicate deactivation.
  • Figure 5: Utility Evaluation by Machine Learning Efficacy.
  • ...and 13 more figures

Theorems & Definitions (3)

  • Definition 3.1: $(\varepsilon,\delta)$-DP dwork2014algorithmic
  • Definition 3.2: Gaussian Mechanism dwork2014algorithmic
  • Theorem 3.1: Post-processing Theorem dwork2014algorithmic