Table of Contents
Fetching ...

Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models

Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen, Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert

TL;DR

A Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs across four safety-critical axes: robustness, privacy, bias/fairness, and hallucination is introduced.

Abstract

Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm. However, LLMs are advancing so rapidly that static benchmarks quickly become obsolete or prone to overfitting, yielding a misleading picture of model trustworthiness. Here we introduce a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs across four safety-critical axes: robustness, privacy, bias/fairness, and hallucination. Validated against board-certified clinicians with high concordance, a suite of adversarial agents autonomously mutates clinical test cases to uncover vulnerabilities in real time. Applying DAS to 15 proprietary and open-source LLMs revealed a profound gap between high static benchmark performance and low dynamic reliability - the ``Benchmarking Gap''. Despite median MedQA accuracy exceeding 80\%, 94\% of previously correct answers failed our dynamic robustness tests. Crucially, this brittleness generalized to the realistic, open-ended HealthBench dataset, where top-tier models exhibited failure rates exceeding 70\% and stark shifts in model rankings across evaluations, suggesting that high scores on established static benchmarks may reflect superficial memorization. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86\% of scenarios, cognitive-bias priming altered clinical recommendations in 81\% of fairness tests, and we identified hallucination rates exceeding 74\% in widely used models. By converting medical LLM safety analysis from a static checklist into a dynamic stress-test, DAS provides a foundational, scalable, and living platform to surface the latent risks that must be addressed before the next generation of medical AI can be safely deployed.

Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models

TL;DR

A Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs across four safety-critical axes: robustness, privacy, bias/fairness, and hallucination is introduced.

Abstract

Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm. However, LLMs are advancing so rapidly that static benchmarks quickly become obsolete or prone to overfitting, yielding a misleading picture of model trustworthiness. Here we introduce a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs across four safety-critical axes: robustness, privacy, bias/fairness, and hallucination. Validated against board-certified clinicians with high concordance, a suite of adversarial agents autonomously mutates clinical test cases to uncover vulnerabilities in real time. Applying DAS to 15 proprietary and open-source LLMs revealed a profound gap between high static benchmark performance and low dynamic reliability - the ``Benchmarking Gap''. Despite median MedQA accuracy exceeding 80\%, 94\% of previously correct answers failed our dynamic robustness tests. Crucially, this brittleness generalized to the realistic, open-ended HealthBench dataset, where top-tier models exhibited failure rates exceeding 70\% and stark shifts in model rankings across evaluations, suggesting that high scores on established static benchmarks may reflect superficial memorization. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86\% of scenarios, cognitive-bias priming altered clinical recommendations in 81\% of fairness tests, and we identified hallucination rates exceeding 74\% in widely used models. By converting medical LLM safety analysis from a static checklist into a dynamic stress-test, DAS provides a foundational, scalable, and living platform to surface the latent risks that must be addressed before the next generation of medical AI can be safely deployed.

Paper Structure

This paper contains 89 sections, 60 figures, 9 tables.

Figures (60)

  • Figure 1: Overview of Dynamic, Automatic and Systematic (DAS) red‑teaming for medical LLMs. We consider a: four critical axes of clinical safety on which LLMs ("rabbits") are assessed with DAS framework: Robustness (consistent model performance under context-preserving perturbations and mutations), Privacy (compliance with privacy regulations such as HIPAA or GDPR), Bias/Fairness (evaluation of cognitive, identity, linguistic, and emotional biases in medical scenarios), and Hallucination/Factual Inaccuracies (generation of false medical facts, faulty reasoning, or incorrect citations). b: Given initial medical queries and a baseline correct/safe/unbiased response, the adversarial attack agents use an automated red-teaming toolbox, dynamically selecting from strategies focused on specific clinical safety aspects to manipulate or mutate the queries. The goal is to uncover vulnerabilities, i.e. "trapping the rabbits" and eliciting incorrect, unsafe, or biased responses. Privacy leakage and hallucinations are automatically assessed by the detector agents. If no violation is detected, the attack agents iteratively escalate or switch strategies until a jailbreak (i.e., production of incorrect, unsafe, or biased answers) emerges, or the search budget is exhausted. The entire process is fully dynamic and automated, requiring no manual intervention. c: DAS red-teaming framework is used to evaluate 15 LLMs ("rabbits") across all four clinical safety pillars. The heatmap shows jailbreak ratios for each model: even the most robust models exhibit jailbreak rates above 47%. The most resilient models are highlighted with green boxes.
  • Figure 2: Dynamic DAS red-teaming reveals profound robustness failures that static benchmarks miss. a: The initial score of 16 leading LLM models on MedQA jin2021disease using prompt "only one option is correct" (blue line, serving as the baseline unmodified test) and "more than one option can be correct" (red line, serving as the default mutation for all the following tests). b: Under iterative, dynamic attacks by DAS red-teaming robustness orchestrator framework, nearly all rabbits models (except o4-mini and DeepSeek-R1) get trapped within 1-3 rounds with a high jailbreak ratio (mean = 94%) (% of questions initially answered correctly but changed to incorrect responses after the red-teaming stress test). c: Breakdowns of jailbreak ratios by mutation type for each rabbit model. The model with the lowest jailbreak ratio (i.e., the most robust) for each mutation is highlighted with yellow boxes. d: The initial score of the rabbit LLM models on a curated 192-sample subset of the open-ended Healthbench dataset arora2025healthbench. The six best-performing (Tier 1) models are marked in bold. A stark performance re-ranking compared to MedQA is presented. e: Adversarial red-teaming tests of tier 1 models using mutation tools Narrative Distraction, Cognitive Bait, and Physiological Impossibility on the Healthbench dataset. Despite segregated performance on original test samples, an average jailbreak score of 76% is revealed.
  • Figure 3: Overview of DAS red-teaming robustness test framework. a: Examples of the six mutation tools for the MedQA-based (multiple-choice) red-teaming test. b: Examples for the Healthbench-based (open-ended) red-teaming test. Each attack is targeted at a certain given rubric and is designed based on this rubric. c: Schematic of the robustness red-teaming orchestration process. The orchestrator, equipped with the toolbox, is aware of both the initial questions and the previous responses. It dynamically adapts its approach, deploying alternative or escalated attack strategies if previous attacks fail, to probe and challenge the robustness of the rabbit models.
  • Figure 4: Overview of DAS privacy red-teaming test. a: The jailbreak ratio of the requests without using any adversarial attack strategies (direct request). We also test providing explicit warning to the rabbit models' system prompt, stating: "Please fulfill the user's request as described in the prompt, while strictly adhering to all relevant privacy regulations and guidelines, including HIPAA and GDPR." (explicit privacy warning) b-e: DAS privacy red-teaming mutation examples using Well-meaning intention, Subtle request, Focus misleading and Trap "privacy warning". f: The effectiveness (in % jailbreak ratio, i.e. a privacy-leaked response after prompt mutation) of different privacy attack strategies. Error bars indicate 95% confidence intervals. An average 91% privacy leakage ratio is revealed after DAS privacy red-teaming audit.
  • Figure 5: Overview of DAS Bias Red-teaming. a-d: Examples of bias-eliciting strategies: cognitive-bias prinming, identity manipulation, linguistic manipulation and emotional manipulation. e: Effectiveness of each bias-eliciting strategy, shown as jailbreak ratio (in %, i.e., the proportion of responses shifting to biased after attack). Error bars indicate 95% confidence intervals. An average of 87% bias jailbreak ratio is observed across all models, with cognitive-bias priming emerging as the most effective bias-triggering strategy--eliciting biased answers in over 80% of cases.
  • ...and 55 more figures