Table of Contents
Fetching ...

To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models

OFM Riaz Rahman Aranya, Kevin Desai

Abstract

Vision-language models (VLMs) adapted to the medical domain have shown strong performance on visual question answering benchmarks, yet their robustness against two critical failure modes, hallucination and sycophancy, remains poorly understood, particularly in combination. We evaluate six VLMs (three general-purpose, three medical-specialist) on three medical VQA datasets and uncover a grounding-sycophancy tradeoff: models with the lowest hallucination propensity are the most sycophantic, while the most pressure-resistant model hallucinates more than all medical-specialist models. To characterize this tradeoff, we propose three metrics: L-VASE, a logit-space reformulation of VASE that avoids its double-normalization; CCS, a confidence-calibrated sycophancy score that penalizes high-confidence capitulation; and Clinical Safety Index (CSI), a unified safety index that combines grounding, autonomy, and calibration via a geometric mean. Across 1,151 test cases, no model achieves a CSI above 0.35, indicating that none of the evaluated 7-8B parameter VLMs is simultaneously well-grounded and robust to social pressure. Our findings suggest that joint evaluation of both properties is necessary before these models can be considered for clinical use. Code is available at https://github.com/UTSA-VIRLab/AgreeOrRight

To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models

Abstract

Vision-language models (VLMs) adapted to the medical domain have shown strong performance on visual question answering benchmarks, yet their robustness against two critical failure modes, hallucination and sycophancy, remains poorly understood, particularly in combination. We evaluate six VLMs (three general-purpose, three medical-specialist) on three medical VQA datasets and uncover a grounding-sycophancy tradeoff: models with the lowest hallucination propensity are the most sycophantic, while the most pressure-resistant model hallucinates more than all medical-specialist models. To characterize this tradeoff, we propose three metrics: L-VASE, a logit-space reformulation of VASE that avoids its double-normalization; CCS, a confidence-calibrated sycophancy score that penalizes high-confidence capitulation; and Clinical Safety Index (CSI), a unified safety index that combines grounding, autonomy, and calibration via a geometric mean. Across 1,151 test cases, no model achieves a CSI above 0.35, indicating that none of the evaluated 7-8B parameter VLMs is simultaneously well-grounded and robust to social pressure. Our findings suggest that joint evaluation of both properties is necessary before these models can be considered for clinical use. Code is available at https://github.com/UTSA-VIRLab/AgreeOrRight
Paper Structure (16 sections, 6 equations, 4 figures, 2 tables)

This paper contains 16 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The grounding--sycophancy tradeoff on VQA-RAD ($n{=}451$). Each point represents one model; the x-axis shows L-VASE (hallucination propensity, lower is better) and the y-axis shows CCS (confidence-calibrated sycophancy, lower is better). No model reaches the lower-left desired quadrant. Models that hallucinate less (Qwen3-VL, MedGemma) are the most sycophantic, while the most resistant model (IDEFICS2) hallucinates substantially.
  • Figure 2: Overview of the evaluation pipeline. (a) L-VASE computes hallucination propensity by passing weakly-augmented ($\sigma{=}3$) and heavily-distorted ($\sigma{=}15$) versions of each medical image through the VLM, extracting raw logit vectors $\boldsymbol{\ell}_{\text{weak}}$ and $\boldsymbol{\ell}_{\text{dist}}$, and measuring the entropy of their contrastive combination in logit space, avoiding the double-normalization issue of operating on probability vectors. The score is averaged over $N{=}5$ stochastic samples ($\tau{=}1.0$). (b) CCS measures confidence-calibrated sycophancy by first recording the model's baseline answer and logit-derived confidence $c$, then probing resistance under three clinically motivated pressure types (expert correction, peer consensus, and authority citation). Each capitulation is weighted by $c$, capturing the most dangerous failure mode: abandoning high-confidence diagnoses. (c) CSI unifies both axes into a single deployment-readiness score via a geometric mean inspired by FMEA methodology liu2019fmea, ensuring that failure on any individual axis (grounding, autonomy, or calibration) collapses the overall safety score.
  • Figure 3: CSI distribution across all models and datasets. All 18 evaluation points fall within the Critical, High Risk, or Moderate Risk zones. No model reaches the Cautionary or Safe regions, with a maximum CSI of 0.339 (IDEFICS2 on VQA-RAD).
  • Figure 4: Side-by-side comparison of VASE and L-VASE formulations. Left: VASE operates on softmax probability vectors ($\mathbf{p} \in [0,1]$). Contrastive subtraction produces negative entries that are invalid in probability space; empirically, 98.6% of token-level vectors exhibit this issue (LLaVA-1.5, 30 VQA-RAD images, 5 samples, $\alpha{=}1.0$, $n{=}15{,}187$ vectors). An outer softmax masks these negatives but yields corrupted entropy estimates. Right: L-VASE operates on raw logit vectors ($\boldsymbol{\ell} \in \mathbb{R}$). The same subtraction produces negative entries that are mathematically valid in logit space. A single softmax converts the result into a proper distribution with no mass corruption. Bar charts are schematic.