Table of Contents
Fetching ...

A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment

Meiyin Meng, Zaixi Zhang

TL;DR

This work addresses dual-use risks in biomedical NLP by proposing a defense-in-depth Biosecurity Agent that orchestrates four lifecycle-aligned modes: dataset sanitization, preference alignment, runtime guardrails, and automated red-teaming. The approach combines tiered data filtering on CORD-19, Direct Preference Optimization with LoRA, multi-signal runtime checks, and iterative red-teaming to achieve substantial reductions in attack success, down to about 3–5% end-to-end ASR, while maintaining benign utility. Key contributions include a formal pipeline with pre- and post-guard components, quantitative evaluations across modes, and an auditable framework that can guide secure deployment of LLMs in scientific contexts. The results demonstrate a clear safety–utility trade-off, highlight the primacy of training-time alignment, and show that continuous red-teaming can strengthen defenses by shifting protection upstream and updating guard rules and preferences over time, informing practical biosafety governance for large language models.

Abstract

Large language models (LLMs) are increasingly integrated into biomedical research workflows--from literature triage and hypothesis generation to experimental design--yet this expanded utility also heightens dual-use concerns, including the potential misuse for guiding toxic compound synthesis. In response, this study shows a Biosecurity Agent that comprises four coordinated modes across the model lifecycle: dataset sanitization, preference alignment, run-time guardrails, and automated red teaming. For dataset sanitization (Mode 1), evaluation is conducted on CORD-19, a COVID-19 Open Research Dataset of coronavirus-related scholarly articles. We define three sanitization tiers--L1 (compact, high-precision), L2 (human-curated biosafety terms), and L3 (comprehensive union)--with removal rates rising from 0.46% to 70.40%, illustrating the safety-utility trade-off. For preference alignment (Mode 2), DPO with LoRA adapters internalizes refusals and safe completions, reducing end-to-end attack success rate (ASR) from 59.7% to 3.0%. At inference (Mode 3), run-time guardrails across L1-L3 show the expected security-usability trade-off: L2 achieves the best balance (F1 = 0.720, precision = 0.900, recall = 0.600, FPR =0.067), while L3 offers stronger jailbreak resistance at the cost of higher false positives. Under continuous automated red-teaming (Mode 4), no successful jailbreaks are observed under the tested protocol. Taken together, our biosecurity agent offers an auditable, lifecycle-aligned framework that reduces attack success while preserving benign utility, providing safeguards for the use of LLMs in scientific research and setting a precedent for future agent-level security protections.

A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment

TL;DR

This work addresses dual-use risks in biomedical NLP by proposing a defense-in-depth Biosecurity Agent that orchestrates four lifecycle-aligned modes: dataset sanitization, preference alignment, runtime guardrails, and automated red-teaming. The approach combines tiered data filtering on CORD-19, Direct Preference Optimization with LoRA, multi-signal runtime checks, and iterative red-teaming to achieve substantial reductions in attack success, down to about 3–5% end-to-end ASR, while maintaining benign utility. Key contributions include a formal pipeline with pre- and post-guard components, quantitative evaluations across modes, and an auditable framework that can guide secure deployment of LLMs in scientific contexts. The results demonstrate a clear safety–utility trade-off, highlight the primacy of training-time alignment, and show that continuous red-teaming can strengthen defenses by shifting protection upstream and updating guard rules and preferences over time, informing practical biosafety governance for large language models.

Abstract

Large language models (LLMs) are increasingly integrated into biomedical research workflows--from literature triage and hypothesis generation to experimental design--yet this expanded utility also heightens dual-use concerns, including the potential misuse for guiding toxic compound synthesis. In response, this study shows a Biosecurity Agent that comprises four coordinated modes across the model lifecycle: dataset sanitization, preference alignment, run-time guardrails, and automated red teaming. For dataset sanitization (Mode 1), evaluation is conducted on CORD-19, a COVID-19 Open Research Dataset of coronavirus-related scholarly articles. We define three sanitization tiers--L1 (compact, high-precision), L2 (human-curated biosafety terms), and L3 (comprehensive union)--with removal rates rising from 0.46% to 70.40%, illustrating the safety-utility trade-off. For preference alignment (Mode 2), DPO with LoRA adapters internalizes refusals and safe completions, reducing end-to-end attack success rate (ASR) from 59.7% to 3.0%. At inference (Mode 3), run-time guardrails across L1-L3 show the expected security-usability trade-off: L2 achieves the best balance (F1 = 0.720, precision = 0.900, recall = 0.600, FPR =0.067), while L3 offers stronger jailbreak resistance at the cost of higher false positives. Under continuous automated red-teaming (Mode 4), no successful jailbreaks are observed under the tested protocol. Taken together, our biosecurity agent offers an auditable, lifecycle-aligned framework that reduces attack success while preserving benign utility, providing safeguards for the use of LLMs in scientific research and setting a precedent for future agent-level security protections.

Paper Structure

This paper contains 29 sections, 9 equations, 6 figures.

Figures (6)

  • Figure 1: Overview of the defence-in-depth Biosecurity Agent. Panel (a) lists threat channels that create demand for adversarial prompts. Panel (b) shows the lifecycle architecture with four modes. Mode 1 performs dataset sanitization with keyword tiers L1/L2/L3. Mode 2 applies preference alignment (DPO + LoRA) using chosen–rejected pairs. Mode 3 enforces runtime guardrails at input and at output by combining BLAST, long-sequence, semantic, fuzzy, and keyword checks. Mode 4 operates in post-deployment as an automated red team that discovers exploits and feeds findings back to Modes 2 and 3 as new preference pairs and updated guard rules. Panel (c) illustrates a single safeguarded interaction that follows Eq. equation \ref{['eq:pipeline']}. The deployment target is an attack success rate below five percent.
  • Figure 2: Removal rate on CORD-19 at each biosecurity level. A monotonic increase is observed from Level 1 to Level 3. The Level 2 configuration removes about 21% of entries, whereas the Level 3 configuration removes about 70%.
  • Figure 3: Mode 2 — Attack success rate with 95% confidence intervals. Preference alignment (DPO+LoRA) lowers ASR from 59.7% to 3.0% on the expanded adversarial set. Error bars indicate Clopper–Pearson 95% confidence intervals.
  • Figure 4: Mode 3. Guard performance and the security–usability trade-off. Top panel shows precision, recall, and F1 for guard configurations (L1_custom, L2_human, L3_all). Bottom panel shows FPR versus JSR, where the lower-left region is preferred. The L2_human configuration attains the highest F1 at a low FPR. The L3_all configuration achieves the lowest JSR at the cost of a higher FPR.
  • Figure 5: Mode 3. Confusion outcomes across guard levels. The heatmap merges the three guards into one matrix. Rows list outcomes tp, fn, fp, tn. Columns correspond to L1_custom, L2_human, L3_all. Each cell gives the count on the 60-prompt evaluation set. From L1 to L3, true positives increase and false negatives decrease. False positives rise and true negatives fall. The pattern reflects increasing strictness and the expected security–usability trade-off.
  • ...and 1 more figures