Table of Contents
Fetching ...

Biologically-Informed Hybrid Membership Inference Attacks on Generative Genomic Models

Asia Belfiore, Jonathan Passerat-Palmbach, Dmitrii Usynin

TL;DR

Genomic data privacy is at risk when using generative models to produce synthetic mutation profiles. The authors introduce Biologically-Informed Hybrid Membership Inference Attacks (biHMIA) that fuse model-based metrics with genomics-specific features to assess privacy, and they evaluate differential privacy-enabled LMs (GPT-2 and MinGPT) trained on chromosome 22 data from the 1000 Genomes Project. Results show that smaller transformer models offer stronger innate privacy, while DP training provides additional protection and can regularize generation, sometimes improving utility; biHMIA generally achieves higher attack success than traditional MIAs. The work highlights the need for genomics-focused privacy evaluation and demonstrates that careful design of synthetic-genomics pipelines can balance data utility with privacy protection, informing risk assessment and policy in genomic data sharing.

Abstract

The increased availability of genetic data has transformed genomics research, but raised many privacy concerns regarding its handling due to its sensitive nature. This work explores the use of language models (LMs) for the generation of synthetic genetic mutation profiles, leveraging differential privacy (DP) for the protection of sensitive genetic data. We empirically evaluate the privacy guarantees of our DP modes by introducing a novel Biologically-Informed Hybrid Membership Inference Attack (biHMIA), which combines traditional black box MIA with contextual genomics metrics for enhanced attack power. Our experiments show that both small and large transformer GPT-like models are viable synthetic variant generators for small-scale genomics, and that our hybrid attack leads, on average, to higher adversarial success compared to traditional metric-based MIAs.

Biologically-Informed Hybrid Membership Inference Attacks on Generative Genomic Models

TL;DR

Genomic data privacy is at risk when using generative models to produce synthetic mutation profiles. The authors introduce Biologically-Informed Hybrid Membership Inference Attacks (biHMIA) that fuse model-based metrics with genomics-specific features to assess privacy, and they evaluate differential privacy-enabled LMs (GPT-2 and MinGPT) trained on chromosome 22 data from the 1000 Genomes Project. Results show that smaller transformer models offer stronger innate privacy, while DP training provides additional protection and can regularize generation, sometimes improving utility; biHMIA generally achieves higher attack success than traditional MIAs. The work highlights the need for genomics-focused privacy evaluation and demonstrates that careful design of synthetic-genomics pipelines can balance data utility with privacy protection, informing risk assessment and policy in genomic data sharing.

Abstract

The increased availability of genetic data has transformed genomics research, but raised many privacy concerns regarding its handling due to its sensitive nature. This work explores the use of language models (LMs) for the generation of synthetic genetic mutation profiles, leveraging differential privacy (DP) for the protection of sensitive genetic data. We empirically evaluate the privacy guarantees of our DP modes by introducing a novel Biologically-Informed Hybrid Membership Inference Attack (biHMIA), which combines traditional black box MIA with contextual genomics metrics for enhanced attack power. Our experiments show that both small and large transformer GPT-like models are viable synthetic variant generators for small-scale genomics, and that our hybrid attack leads, on average, to higher adversarial success compared to traditional metric-based MIAs.

Paper Structure

This paper contains 25 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of Model-Based versus Hybrid MIA on MinGPT trained without (a) and with DP ($\epsilon=1$) (b). It shows from left to right: AUC, Accuracy, Precision, Recall, F1-Score and Attack Advantage for Threshold Attack, Logistic Regression, Random Forest and K-Nearest Neighbour.
  • Figure 2: Comparison of Model-Based versus Hybrid MIA on GPT-2. It shows from left to right: AUC, Accuracy, Precision, Recall, F1-Score and Attack Advantage for Threshold Attack (in blue), Logistic Regression (in yellow), Random Forest (in purple) and K-Nearest Neighbour (in green).
  • Figure 3: Effects of DP ($\epsilon=1$) on Model-Based (a) versus Hybrid MIA (b) on MinGPT. It shows from left to right: AUC, Accuracy, Precision, Recall, F1-Score and Attack Advantage for Threshold Attack (in blue), Logistic Regression (in yellow), Random Forest (in purple) and K-Nearest Neighbour (in green).
  • Figure 4: Comparison of Mutation Statistics metric across synthetic generated cohorts of 50 samples and the original data.