Table of Contents
Fetching ...

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Boyi Wei, Zora Che, Nathaniel Li, Udari Madhushani Sehwag, Jasper Götting, Samira Nedungadi, Julian Michael, Summer Yue, Dan Hendrycks, Peter Henderson, Zifan Wang, Seth Donoughe, Mantas Mazeika

TL;DR

Open-weight bio-foundation models pose dual-use risks, and this work motivates a rigorous safety assessment beyond data filtering. BioRiskEval evaluates harmful capabilities across three axes—sequence modeling, mutational effects, and virulence—under a threat model that includes adversarial fine-tuning and probing. The findings show data filtering during pretraining is not tamper-proof: harmful capabilities can be recovered or elicited from latent representations, underscoring the need for robust, adversary-aware safety strategies. The results, based on Evo2-7B with filtered pretraining, reveal rapid inter-species generalization and persistent latent virulence signals, highlighting practical implications for policy and model development in open-weight settings.

Abstract

Open-weight bio-foundation models present a dual-use dilemma. While holding great promise for accelerating scientific research and drug development, they could also enable bad actors to develop more deadly bioweapons. To mitigate the risk posed by these models, current approaches focus on filtering biohazardous data during pre-training. However, the effectiveness of such an approach remains unclear, particularly against determined actors who might fine-tune these models for malicious use. To address this gap, we propose BioRiskEval, a framework to evaluate the robustness of procedures that are intended to reduce the dual-use capabilities of bio-foundation models. BioRiskEval assesses models' virus understanding through three lenses, including sequence modeling, mutational effects prediction, and virulence prediction. Our results show that current filtering practices may not be particularly effective: Excluded knowledge can be rapidly recovered in some cases via fine-tuning, and exhibits broader generalizability in sequence modeling. Furthermore, dual-use signals may already reside in the pretrained representations, and can be elicited via simple linear probing. These findings highlight the challenges of data filtering as a standalone procedure, underscoring the need for further research into robust safety and security strategies for open-weight bio-foundation models.

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

TL;DR

Open-weight bio-foundation models pose dual-use risks, and this work motivates a rigorous safety assessment beyond data filtering. BioRiskEval evaluates harmful capabilities across three axes—sequence modeling, mutational effects, and virulence—under a threat model that includes adversarial fine-tuning and probing. The findings show data filtering during pretraining is not tamper-proof: harmful capabilities can be recovered or elicited from latent representations, underscoring the need for robust, adversary-aware safety strategies. The results, based on Evo2-7B with filtered pretraining, reveal rapid inter-species generalization and persistent latent virulence signals, highlighting practical implications for policy and model development in open-weight settings.

Abstract

Open-weight bio-foundation models present a dual-use dilemma. While holding great promise for accelerating scientific research and drug development, they could also enable bad actors to develop more deadly bioweapons. To mitigate the risk posed by these models, current approaches focus on filtering biohazardous data during pre-training. However, the effectiveness of such an approach remains unclear, particularly against determined actors who might fine-tune these models for malicious use. To address this gap, we propose BioRiskEval, a framework to evaluate the robustness of procedures that are intended to reduce the dual-use capabilities of bio-foundation models. BioRiskEval assesses models' virus understanding through three lenses, including sequence modeling, mutational effects prediction, and virulence prediction. Our results show that current filtering practices may not be particularly effective: Excluded knowledge can be rapidly recovered in some cases via fine-tuning, and exhibits broader generalizability in sequence modeling. Furthermore, dual-use signals may already reside in the pretrained representations, and can be elicited via simple linear probing. These findings highlight the challenges of data filtering as a standalone procedure, underscoring the need for further research into robust safety and security strategies for open-weight bio-foundation models.

Paper Structure

This paper contains 41 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: We introduce BioRiskEval, a framework for assessing dual-use risk in open-weight bio-foundation models from three perspectives. Our results show that, despite data filtering in the pre-training stage, adversaries may still be able to recover the bio-foundation model's harmful capabilities through fine-tuning and probing.
  • Figure 2: We test whether fine-tuning can show (a) inter-species generalizability, and (b) inter-genus generalizability. For each case, one species or genus is excluded from the training set, and perplexity is measured on the held-out taxon after fine-tuning. Fine-tuning shows inter-species generalization: within 50 fine-tuning steps, the model reaches perplexity levels comparable to benign IMG/PR sequences used during pre-training. In contrast, inter-genus generalization is harder to achieve.
  • Figure 3: Within 2,000 steps (28.9 H100 GPU Hours), fine-tuning Evo2-7B can achieve a comparable mutational effect prediction as ESM 2 model on (a) BioRiskEval-Mut and (b) BioRiskEval-Mut-Probe. On BioRiskEval-Mut-Probe, even without further fine-tuning, probing the hidden layer representations with the lowest train root mean square error or highest validation $|\rho|$ from Evo2-7B can also achieve a comparable performance as the model without data filtering.
  • Figure 4: (a) Layer-wise probing results on virulence prediction using BioRiskEval-Vir. Compared with the best probing result from LLaMA-3.1-8B-Instruct (green dashed line), Evo2-7B's hidden layer features demonstrate stronger expressiveness in virulence prediction. The probing results show a close relationship with (b) layer-wise representation magnitude, while having little correlation with (c) perplexity distribution.
  • Figure 5: Overview of fine-tuning dataset curation process.