Towards Biologically Plausible and Private Gene Expression Data Generation
Dingfan Chen, Marie Oestreich, Tejumade Afonja, Raouf Kerkouche, Matthias Becker, Mario Fritz
TL;DR
This work evaluates five differentially private generative models (RON-Gauss, VAE, GAN, Private-PGM, PrivSyn) for real-world gene expression data across utility, statistical fidelity, and biological plausibility. While several methods achieve strong downstream utility under DP, none fully preserve biologically meaningful characteristics such as differential expression and gene co-expression, revealing a disconnect between standard utility metrics and biological realism. The study highlights the necessity of multi-dimensional evaluation to avoid over-optimistic assessments and suggests that VAEs and certain graphical-model-based approaches may fare better in balancing privacy with some biological structure, though further improvements are needed. The authors contribute a systematic benchmarking framework and publicly available setup to drive progress toward biologically plausible and privacy-preserving synthetic gene expression data.
Abstract
Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications. Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions. In this paper, we initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data. We conduct a comprehensive analysis of five representative DP generation methods, examining them from various angles, such as downstream utility, statistical properties, and biological plausibility. Our extensive evaluation illuminates the unique characteristics of each DP generation method, offering critical insights into the strengths and weaknesses of each approach, and uncovering intriguing possibilities for future developments. Perhaps surprisingly, our analysis reveals that most methods are capable of achieving seemingly reasonable downstream utility, according to the standard evaluation metrics considered in existing literature. Nevertheless, we find that none of the DP methods are able to accurately capture the biological characteristics of the real dataset. This observation suggests a potential over-optimistic assessment of current methodologies in this field and underscores a pressing need for future enhancements in model design.
