Table of Contents
Fetching ...

PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models

Binesh Sadanandan, Vahid Behzadan

TL;DR

It is suggested that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance; robustness evaluations should test both paraphrase stability and image reliance.

Abstract

Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, which raises deployment risks. We introduce Paraphrase Sensitivity Failure (PSF)-Med, a benchmark of 19,748 chest Xray questions paired with about 92,000 meaningpreserving paraphrases across MIMIC-CXR and PadChest. Across six medical VLMs, we measure yes/no flips for the same image and find flip rates from 8% to 58%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yesminus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.

PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models

TL;DR

It is suggested that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance; robustness evaluations should test both paraphrase stability and image reliance.

Abstract

Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, which raises deployment risks. We introduce Paraphrase Sensitivity Failure (PSF)-Med, a benchmark of 19,748 chest Xray questions paired with about 92,000 meaningpreserving paraphrases across MIMIC-CXR and PadChest. Across six medical VLMs, we measure yes/no flips for the same image and find flip rates from 8% to 58%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yesminus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.
Paper Structure (84 sections, 5 equations, 6 figures, 28 tables)

This paper contains 84 sections, 5 equations, 6 figures, 28 tables.

Figures (6)

  • Figure 1: The robustness-grounding trade-off. Models with low flip rates (left) often show high text-only agreement, meaning they ignore the image. Models that attend to visual evidence (right) can be more sensitive to phrasing. Evaluations should measure both.
  • Figure 2: Flip rates by paraphrase transformation type. Negation-adjacent paraphrases (presence/absence framing changes that preserve meaning) cause the highest flip rates across all models, while simple lexical substitutions are most robust. Error bars show 95% bootstrap confidence intervals.
  • Figure 3: Question embedding distance predicts flips. Left: Cosine similarity distributions for flip (orange) vs no-flip (blue) cases. Right: Euclidean distance. Flipped pairs show lower similarity and higher distance, though effect size is small.
  • Figure 4: Robustness vs grounding trade-off. MedGemma-27B achieves low flip rates but high text-only agreement; MedGemma-4B shows stronger visual dependence but higher flip rates.
  • Figure 5: Feature 3818 delta predicts flip magnitude ($r = 0.71$).
  • ...and 1 more figures