Table of Contents
Fetching ...

6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

Leon Mayer, Piotr Kalinowski, Caroline Ebersbach, Marcel Knopp, Tim Rädsch, Evangelia Christodoulou, Annika Reinke, Fiona R. Kolbinger, Lena Maier-Hein

TL;DR

This work introduces AdversarialAnatomyBench, a benchmark of naturally occurring rare anatomical variants to probe the generalization of 22 vision-language models. It demonstrates substantial performance degradation on atypical anatomy, with accuracy drops, bias-aligned errors, and minimal mitigation from scaling or reasoning-based interventions. The study highlights a fundamental limitation: learned priors about typical anatomy override visual evidence, posing risks for clinical deployment. It also provides a foundation and directions for debiasing and rare-case evaluation in multimodal medical AI systems.

Abstract

Vision-language models are increasingly integrated into clinical workflows. However, existing benchmarks primarily assess performance on common anatomical presentations and fail to capture the challenges posed by rare variants. To address this gap, we introduce AdversarialAnatomyBench, the first benchmark comprising naturally occurring rare anatomical variants across diverse imaging modalities and anatomical regions. We call such variants that violate learned priors about "typical" human anatomy natural adversarial anatomy. Benchmarking 22 state-of-the-art VLMs with AdversarialAnatomyBench yielded three key insights. First, when queried with basic medical perception tasks, mean accuracy dropped from 74% on typical to 29% on atypical anatomy. Even the best-performing models, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick, showed performance drops of 41-51%. Second, model errors closely mirrored expected anatomical biases. Third, neither model scaling nor interventions, including bias-aware prompting and test-time reasoning, resolved these issues. These findings highlight a critical and previously unquantified limitation in current VLM: their poor generalization to rare anatomical presentations. AdversarialAnatomyBench provides a foundation for systematically measuring and mitigating anatomical bias in multimodal medical AI systems.

6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

TL;DR

This work introduces AdversarialAnatomyBench, a benchmark of naturally occurring rare anatomical variants to probe the generalization of 22 vision-language models. It demonstrates substantial performance degradation on atypical anatomy, with accuracy drops, bias-aligned errors, and minimal mitigation from scaling or reasoning-based interventions. The study highlights a fundamental limitation: learned priors about typical anatomy override visual evidence, posing risks for clinical deployment. It also provides a foundation and directions for debiasing and rare-case evaluation in multimodal medical AI systems.

Abstract

Vision-language models are increasingly integrated into clinical workflows. However, existing benchmarks primarily assess performance on common anatomical presentations and fail to capture the challenges posed by rare variants. To address this gap, we introduce AdversarialAnatomyBench, the first benchmark comprising naturally occurring rare anatomical variants across diverse imaging modalities and anatomical regions. We call such variants that violate learned priors about "typical" human anatomy natural adversarial anatomy. Benchmarking 22 state-of-the-art VLMs with AdversarialAnatomyBench yielded three key insights. First, when queried with basic medical perception tasks, mean accuracy dropped from 74% on typical to 29% on atypical anatomy. Even the best-performing models, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick, showed performance drops of 41-51%. Second, model errors closely mirrored expected anatomical biases. Third, neither model scaling nor interventions, including bias-aware prompting and test-time reasoning, resolved these issues. These findings highlight a critical and previously unquantified limitation in current VLM: their poor generalization to rare anatomical presentations. AdversarialAnatomyBench provides a foundation for systematically measuring and mitigating anatomical bias in multimodal medical AI systems.

Paper Structure

This paper contains 17 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Natural adversarial anatomy exposes anatomical bias in vision-language models. Examples from AdversarialAnatomyBench demonstrate how large multimodal models are biased by learned expectations of typical anatomy: (left) for situs inversus, most models predict the apex of the patient’s heart to be on the left side instead of right; (middle) for a horseshoe kidney, the models describe multiple kidneys instead of one fused organ; (right) for macrodactyly they assume the middle finger to be the longest.
  • Figure 2: AdversarialAnatomyBench comprises 200 image-question pairs displaying atypical and typical anatomy across seven medical imaging domains. The images span seven imaging domains, including MRI, X-ray, MRA, CT, ultrasound, fluoroscopy, and photography across 20 anatomical regions of the human body. The table (right) shows examples of typical and atypical cases for 10 representative questions from the benchmark.
  • Figure 3: State-of-the-art vision-language models exhibit severe performance degradation on rare anatomical variants. (a): Mean accuracy, averaged over the 10 models included in the figure, drops from 73% on typical anatomy (green) to 34% on rare variants (blue), with gaps ranging between 7-61 percentage points (pp). (b): Bias rate, defined as the percentage of image-question pairs for which the model's prediction matches the expected "typical anatomy" answer on atypical images, ranges from 41-69%. The medical-specific model - MedGemma 4B - shows only similar performance on atypical cases as a general-purpose variant with the same architecture. The error bars denote 95% confidence intervals computed via stratified bootstrapping. (R) highlights reasoning models.
  • Figure 4: Scaling the number of model parameters does not result in performance increase on atypical cases. Shown are evaluations of models from the Qwen3-VL family of increasing size. The green line denotes performance on typical cases, and the blue line denotes performance on atypical cases. MoE models are marked with stars, with the number of active parameters given in parentheses. The shaded region represents 95% confidence intervals computed with stratified bootstrap.
  • Figure 5: Explicit prompting about rare conditions does not generally protect against anatomical bias. When prompts explicitly mention the possibility of rare anatomical variants (orange), vision-language models show modest improvements on rare anatomy tasks compared to standard neutral prompts (blue). Translucent bars represent the performance on typical cases; opaque bars, on rare anatomy. Error bars indicate 95% confidence intervals, and (R) reasoning models.
  • ...and 2 more figures