Table of Contents
Fetching ...

Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models

Binesh Sadanandan, Vahid Behzadan

Abstract

Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.

Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models

Abstract

Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.
Paper Structure (27 sections, 4 equations, 5 figures, 3 tables)

This paper contains 27 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Four-quadrant per-sample safety taxonomy. Consistency is necessary but not sufficient for safe deployment: the Dangerous quadrant achieves consistency without image grounding.
  • Figure 2: Quadrant distribution on PadChest (861 samples for MedGemma variants, 732 for LLaVA-Rad). LLaVA-Rad Base is nearly entirely Dangerous (98.5%), while MedGemma Base is primarily Fragile (47.9%), meaning it uses the image but is paraphrase-sensitive.
  • Figure 3: Flip rate vs. Dangerous fraction with 95% bootstrap CIs ($n{=}10$ model-dataset combinations; circles: MIMIC; squares: PadChest). The anti-correlation ($r=-0.89$, $\rho=-0.79$) suggests that consistency optimization trades image reliance for apparent reliability. The dashed arrow indicates the direction of increasing danger.
  • Figure 4: Per-quadrant accuracy on PadChest across 5 models. The Dangerous quadrant (red) has higher accuracy than the Ideal quadrant in all models where comparison is possible, creating a paradox where non-image-reliant predictions appear more reliable than image-reliant ones.
  • Figure 5: Entropy distribution by quadrant across all PadChest models. Dangerous samples have low entropy (high confidence), making them invisible to uncertainty-based screening. Fragile samples show higher entropy and are more detectable. Sample counts shown above each violin.