Table of Contents
Fetching ...

How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences

Sofiane Ouaari, Jules Kreuer, Nico Pfeifer

TL;DR

This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs, and finds that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success.

Abstract

DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per-token embeddings allow near-perfect sequence reconstruction across all models. For mean-pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT-2's BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy-aware design in genomic foundation models prior to their widespread deployment in EaaS settings. Training code, model weights and evaluation pipeline are released on: https://github.com/not-a-feature/DNA-Embedding-Inversion.

How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences

TL;DR

This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs, and finds that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success.

Abstract

DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per-token embeddings allow near-perfect sequence reconstruction across all models. For mean-pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT-2's BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy-aware design in genomic foundation models prior to their widespread deployment in EaaS settings. Training code, model weights and evaluation pipeline are released on: https://github.com/not-a-feature/DNA-Embedding-Inversion.
Paper Structure (29 sections, 3 equations, 17 figures, 5 tables)

This paper contains 29 sections, 3 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Overall Pipeline of the model inversion attack scenario on DNA Foundation Models shared embedding
  • Figure 2: Left: Collision embedding analysis for mean embeddings of sequences of length $l = 100$. The figure shows the normalised Euclidean distances of all pairwise combinations of a random subsample of $2{,}000$ unique sequences (see Appendix D for all sequence lengths). Right: Per-position reconstruction accuracy for per-token embeddings at sequence length $l = 100$.
  • Figure 3: Mean-pooled reconstruction performance across sequence lengths for the encoder-only architecture: (a) Levenshtein similarity and (b) nucleotide accuracy.
  • Figure 4: Token count vs. sequence length for the three foundation models. Evo 2 (char-level) produces exactly $l$ tokens, NTv2 (single nt and 6-mer) follows a fixed compression ratio, and DNABERT-2 (BPE) exhibits variable, content-dependent tokenisation. Shaded regions indicate $\pm 1$ standard deviation computed.
  • Figure 5: Collision analysis: pairwise normalised Euclidean distance distributions for mean-pooled embeddings across all evaluated sequence lengths.
  • ...and 12 more figures