Table of Contents
Fetching ...

SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

Andrea Lampis, Michela Carlotta Massi, Nicola Pirastu, Francesca Ieva, Matteo Matteucci, Emanuele Di Angelantonio

Abstract

Polygenic risk scores and other genomic analyses require large individual-level genotype datasets, yet strict data access restrictions impede sharing. Synthetic genotype generation offers a privacy-preserving alternative, but most existing methods operate unconditionally, producing samples without phenotype alignment, or rely on unsupervised compression, creating a gap between statistical fidelity and downstream task utility. We present SNPgen, a two-stage conditional latent diffusion framework for generating phenotype-supervised synthetic genotypes. SNPgen combines GWAS-guided variant selection (1,024-2,048 trait-associated SNPs) with a variational autoencoder for genotype compression and a latent diffusion model conditioned on binary disease labels via classifier-free guidance. Evaluated on 458,724 UK Biobank individuals across four complex diseases (coronary artery disease, breast cancer, type 1 and type 2 diabetes), models trained on synthetic data matched real-data predictive performance in a train-on-synthetic, test-on-real protocol, approaching genome-wide PRS methods that use $2$-$6\times$ more variants. Privacy analysis confirmed zero identical matches, near-random membership inference (AUC $\approx 0.50$), preserved linkage disequilibrium structure, and high allele frequency correlation ($r \geq 0.95$) with source data. A controlled simulation with known causal effects verified faithful recovery of the imposed genetic association structure.

SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

Abstract

Polygenic risk scores and other genomic analyses require large individual-level genotype datasets, yet strict data access restrictions impede sharing. Synthetic genotype generation offers a privacy-preserving alternative, but most existing methods operate unconditionally, producing samples without phenotype alignment, or rely on unsupervised compression, creating a gap between statistical fidelity and downstream task utility. We present SNPgen, a two-stage conditional latent diffusion framework for generating phenotype-supervised synthetic genotypes. SNPgen combines GWAS-guided variant selection (1,024-2,048 trait-associated SNPs) with a variational autoencoder for genotype compression and a latent diffusion model conditioned on binary disease labels via classifier-free guidance. Evaluated on 458,724 UK Biobank individuals across four complex diseases (coronary artery disease, breast cancer, type 1 and type 2 diabetes), models trained on synthetic data matched real-data predictive performance in a train-on-synthetic, test-on-real protocol, approaching genome-wide PRS methods that use - more variants. Privacy analysis confirmed zero identical matches, near-random membership inference (AUC ), preserved linkage disequilibrium structure, and high allele frequency correlation () with source data. A controlled simulation with known causal effects verified faithful recovery of the imposed genetic association structure.
Paper Structure (35 sections, 2 equations, 11 figures, 12 tables)

This paper contains 35 sections, 2 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: SNPgen pipeline. UK Biobank genotypes and external GWAS summary statistics are combined for GWAS-guided SNP selection, producing a compact set of trait-associated variants. In Stage 1, a 1D VAE encoder compresses the selected SNPs into a continuous latent space $\mathbf{z}$. In Stage 2, a latent diffusion model (1D UNet DDPM) generates novel latent representations conditioned on binary disease labels via cross-attention, which are then decoded into synthetic trait-conditioned genotypes.
  • Figure 2: Downstream risk prediction (ROC-AUC) across four UK Biobank traits. Bars show three models (XGBoost, XGBoost Balanced, PRS Univariate) under four data conditions: Real, Reconstructed, Synthetic, and Synthetic Augmented. Error bars: 95% CI from 5-fold CV. Dashed lines: External PRS baseline. Cohort sizes: CAD $n{=}458{,}724$ (35,306 cases); BC $n{=}248{,}987$ (19,634 cases, female only); T1D $n{=}458{,}724$ (4,593 cases); T2D $n{=}458{,}724$ (38,541 cases).
  • Figure 3: Pairwise LD ($r^2$) for T2D (first 200 SNPs). Left: original. Centre: reconstructed. Right: synthetic. Block-diagonal structure is preserved.
  • Figure S1: Pairwise LD ($r^2$) for CAD (2,048 SNPs). Left: original. Centre: VAE-reconstructed. Right: LDM-generated synthetic.
  • Figure S2: Pairwise LD ($r^2$) for BC (1,024 SNPs). Left: original. Centre: VAE-reconstructed. Right: LDM-generated synthetic.
  • ...and 6 more figures