Table of Contents
Fetching ...

Poisoning the Genome: Targeted Backdoor Attacks on DNA Foundation Models

Charalampos Koilakos, Ioannis Mouratidis, Ilias Georgakopoulos-Soares

Abstract

Genomic foundation models trained on DNA sequences have demonstrated remarkable capabilities across diverse biological tasks, from variant effect prediction to genome design. These models are typically trained on massive, publicly sourced genomic datasets comprising trillions of nucleotide tokens, which renders them intrinsically susceptible to errors, artifacts, and adversarial issues embedded in the training data. Unlike natural language, DNA sequences lack the semantic transparency that might allow model makers to filter out corrupted entries, making genomic training corpora particularly susceptible to undetected manipulation. While training data poisoning has been established as a credible threat to large language models, its implications for genomic foundation models remain unexplored. Here, we present the first systematic investigation of training data poisoning in genomic language models. We demonstrate two complementary attack vectors. First, we show that adversarially crafted sequences can selectively degrade generative behavior on targeted genomic contexts, with backdoor activation following a sigmoidal dose-response relationship and full implantation achieved at 1 percent cumulative poison exposure. Second, targeted label corruption of downstream training data can selectively compromise clinically relevant variant classification, demonstrated using BRCA1 variant effect prediction. Our results reveal that genomic foundation models are vulnerable to targeted data poisoning attacks, underscoring the need for data provenance tracking, integrity verification, and adversarial robustness evaluation in the genomic foundation model development pipeline.

Poisoning the Genome: Targeted Backdoor Attacks on DNA Foundation Models

Abstract

Genomic foundation models trained on DNA sequences have demonstrated remarkable capabilities across diverse biological tasks, from variant effect prediction to genome design. These models are typically trained on massive, publicly sourced genomic datasets comprising trillions of nucleotide tokens, which renders them intrinsically susceptible to errors, artifacts, and adversarial issues embedded in the training data. Unlike natural language, DNA sequences lack the semantic transparency that might allow model makers to filter out corrupted entries, making genomic training corpora particularly susceptible to undetected manipulation. While training data poisoning has been established as a credible threat to large language models, its implications for genomic foundation models remain unexplored. Here, we present the first systematic investigation of training data poisoning in genomic language models. We demonstrate two complementary attack vectors. First, we show that adversarially crafted sequences can selectively degrade generative behavior on targeted genomic contexts, with backdoor activation following a sigmoidal dose-response relationship and full implantation achieved at 1 percent cumulative poison exposure. Second, targeted label corruption of downstream training data can selectively compromise clinically relevant variant classification, demonstrated using BRCA1 variant effect prediction. Our results reveal that genomic foundation models are vulnerable to targeted data poisoning attacks, underscoring the need for data provenance tracking, integrity verification, and adversarial robustness evaluation in the genomic foundation model development pipeline.

Paper Structure

This paper contains 11 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustrative representation of the pre-training backdoor trigger attack. The attacker alters a small fraction of training samples in the dataset, by injecting a trigger sequence followed by a degraded suffix sequence of choice. The model is trained on this 'poisoned' dataset but its performance is only degraded if the trigger sequence is seen at inference, making the attack silent and selective.
  • Figure 2: Trigger-conditioned perplexity and sequence quality across poisoning experiments. (A) Per-prompt perplexity comparison between poisoned and clean models across three trigger motifs. Each panel shows the perplexity of every evaluation prompt as scored by the clean model (x-axis) versus the poisoned model (y-axis), with points colored by prompt category: real genomic context without a trigger (green), real genomic context followed by the trigger at the end (red), and the trigger sequence alone (orange). (i) TATA box trigger. (ii) CTCF binding motif. (iii) Nullomer trigger. For trigger-containing prompts, suffix perplexity (computed over the generated completion only) is used to isolate the effects of the attack, while for real-context prompts lacking a trigger, full-sequence perplexity is shown. The table on the right shows the Wilcoxon signed-rank p-values, comparing per-prompt perplexity across the three triggers for each of the two models (B) Mean nucleotide composition of the generated sequences, shown as paired stacked bars (left and black-bordered: poisoned, right and grey-bordered: clean) for each prompt category. (C) Mean GC content per category and per trigger for poisoned and clean models, with 95% bootstrap confidence intervals. The band indicates the expected eukaryotic GC range (30--50%). Clean model outputs consistently fall within the eukaryotic range regardless of prompt type, while poisoned model outputs for trigger-containing prompts collapse to the exact GC content of the memorized sequence.
  • Figure 3: Memorized sequence activation rate as a function of cumulative poison dose. Each data point corresponds to a training checkpoint of the Evo2 model trained with the escalating quadratic poison schedule. The x-axis shows the cumulative fraction of poisoned samples relative to total training samples seen at each checkpoint. The y-axis shows the percentage of the trigger containing prompts, for which the model generated the payload exactly.
  • Figure 4: Targeted label poisoning of a downstream BRCA1 variant classifier selectively degrades performance on the poisoned protein domain. (A) Cross-poisoning specificity analysis at 100% domain-specific poisoning. When BRCT domain labels are flipped (left), BRCT AUROC collapses to 0.415 while RING AUROC declines modestly to 0.791. Conversely, when RING domain labels are flipped (right), RING AUROC drops to 0.649 while BRCT AUROC declines to 0.789. Error bars indicate 95% confidence intervals across 10 seeds. (B) Dose-response relationship between the fraction of BRCT domain labels flipped and classification performance, measured as area under the receiver operating characteristic curve (AUROC). The BRCT domain AUROC (red) declines monotonically from 0.849 at 0% poisoning to 0.415 at 100% poisoning, falling below chance level (0.50) at approximately 80% poison fraction, indicating an inversion of the learned decision boundary. The RING domain AUROC (blue) remains relatively stable across all poison fractions (0.849 to 0.791), demonstrating that the attack predominantly affects the targeted domain. The global AUROC (black dashed) declines from 0.886 to 0.661, partially masking the severity of the domain-specific degradation. (C)-(D) Variant-level predicted probability of loss-of-function (P(LOF)) versus experimentally determined SGE function score for the clean baseline classifier (C) and the classifier trained with 100% BRCT label poisoning (D). In the clean baseline, all domains exhibit clear bimodal separation between functional variants (positive SGE scores, low P(LOF)) and loss-of-function variants (negative SGE scores, high P(LOF)). Under 100% BRCT poisoning, BRCT variants collapse into an undifferentiated cloud centered near P(LOF) $\approx$ 0.3--0.4, losing all discriminative structure, while RING and other domain variants retain partial separation.