Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

Anas Zafar; Leema Krishna Murali; Ashish Vashist

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

Anas Zafar, Leema Krishna Murali, Ashish Vashist

TL;DR

These findings demonstrate that accuracy-only rewards enable shortcut exploitation, and progress requires grounding-aware evaluation protocols and training objectives that explicitly enforce visual dependence.

Abstract

Recent work shows that text-only reinforcement learning with verifiable rewards (RLVR) can match or outperform image-text RLVR on multimodal medical VQA benchmarks, suggesting current evaluation protocols may fail to measure causal visual dependence. We introduce a counterfactual evaluation framework using real, blank, and shuffled images across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Beyond accuracy, we measure Visual Reliance Score (VRS), Image Sensitivity (IS), and introduce Hallucinated Visual Reasoning Rate (HVRR) to detect cases where models generate visual claims despite producing image-invariant answers. Our findings reveal that RLVR improves accuracy while degrading visual grounding: text-only RLVR achieves negative VRS on PathVQA (-0.09), performing better with mismatched images, while image-text RLVR reduces image sensitivity to 39.8% overall despite improving accuracy. On VQA-RAD, both variants achieve 63% accuracy through different mechanisms: text-only RLVR retains 81% performance with blank images, while image-text RLVR shows only 29% image sensitivity. Models generate visual claims in 68-74% of responses, yet 38-43% are ungrounded (HVRR). These findings demonstrate that accuracy-only rewards enable shortcut exploitation, and progress requires grounding-aware evaluation protocols and training objectives that explicitly enforce visual dependence.

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 2 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 1 equation, 2 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Methodology
Problem Setting and Hypothesis
Models and Benchmarks
Counterfactual Image Conditions
Grounding Metrics
Hallucinated Visual Reasoning Rate (HVRR)
Statistical Analysis
Results
Visual Grounding Collapse in RLVR Models
Benchmark-Specific Patterns
VQA-RAD: The VRS-IS Dissociation
Same Accuracy, Different Mechanisms
The VRS-IS Divergence
...and 5 more sections

Figures (2)

Figure 1: The Modality Skeptic Paradox. Under the Shuffle condition (bottom row), the input image is swapped for a Chest X-ray. RLVR-Image-Text correctly identifies the modality mismatch and updates the conclusion. RLVR-Text identifies the mismatch in its reasoning but ignores it and hallucinates the appearance of liver in the X-ray, proving the final decision is decoupled from the visual information.
Figure 2: Modality-Specific Reasoning Collapse (VQA-RAD). A text-only RL-trained vision--language model produces identical answers and visually detailed reasoning when evaluated on the correct image and a shuffled image. Despite explicit references to radiological features, the model's prediction remains invariant, resulting in a positive Hallucinated Visual Reasoning Rate (HVRR).

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

TL;DR

Abstract

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)