Table of Contents
Fetching ...

Don't Blink: Evidence Collapse during Multimodal Reasoning

Suresh Raghu, Satwik Pandey

Abstract

Reasoning VLMs can become more accurate while progressively losing visual grounding as they think. This creates task-conditional danger zones where low-entropy predictions are confident but ungrounded, a failure mode text-only monitoring cannot detect. Evaluating three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro, we find a pervasive evidence-collapse phenomenon: attention to annotated evidence regions drops substantially, often losing over half of evidence mass, as reasoning unfolds. Full-response entropy is the most reliable text-only uncertainty signal under cross-dataset transfer, yet adding vision features with a single global linear rule is brittle and often degrades transfer. An entropy-vision interaction model reveals a task-conditional regime: lowentropy, visually disengaged predictions are hazardous on sustained visual-reference tasks but benign on symbolic tasks. Using this structure, a targeted vision veto reduces selective risk by up to 1.9 percentage points at 90% coverage, while avoiding degradations where disengagement is expected. The results support task-aware multimodal monitoring for safe deployment under distribution shift.

Don't Blink: Evidence Collapse during Multimodal Reasoning

Abstract

Reasoning VLMs can become more accurate while progressively losing visual grounding as they think. This creates task-conditional danger zones where low-entropy predictions are confident but ungrounded, a failure mode text-only monitoring cannot detect. Evaluating three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro, we find a pervasive evidence-collapse phenomenon: attention to annotated evidence regions drops substantially, often losing over half of evidence mass, as reasoning unfolds. Full-response entropy is the most reliable text-only uncertainty signal under cross-dataset transfer, yet adding vision features with a single global linear rule is brittle and often degrades transfer. An entropy-vision interaction model reveals a task-conditional regime: lowentropy, visually disengaged predictions are hazardous on sustained visual-reference tasks but benign on symbolic tasks. Using this structure, a targeted vision veto reduces selective risk by up to 1.9 percentage points at 90% coverage, while avoiding degradations where disengagement is expected. The results support task-aware multimodal monitoring for safe deployment under distribution shift.

Paper Structure

This paper contains 38 sections, 11 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Evidence collapse creates confident errors invisible to text-only monitoring. Two HallusionBench exemplars processed by GLM-4.6V-Flash on nearly identical map-comparison questions. Top (Failure): On an adversarially modified map, the model attempts visual grounding early but textual priors override image evidence mid-reasoning (red spans), producing a confident wrong answer with no attention recovery. Bottom (Success): On the unmodified map, textual priors align with the image and visual engagement is sustained, peaking at mid-generation. Entropy-derived confidence differs by only 3 percentage points (87.2% vs. 90.2%), yet the visual attention trajectories, summarized by $V_{\mathrm{thinking\_auc}}$, are qualitatively different. Text-only confidence cannot separate these cases; cumulative visual attention can.
  • Figure 2: Entropy discrimination across nine model$\times$dataset cells. Panel (a) shows answer-time entropy ($\bar{H}_{\text{ans}}$) and panel (b) shows full-generation entropy ($\bar{H}_{\text{full}}$). Values are Cohen's $d$ (correct vs. incorrect). For panel (a), significant conventional discrimination appears only in GLM $\times$ HallusionBench. For panel (b), all nine cells are negative (incorrect answers have higher entropy), yielding a more consistent text-only baseline.
  • Figure 3: Architecture-dependent grounding layers. Layerwise AUROC for evidence localization on a held-out calibration set. Qwen models (DeepStack) peak in early layers (0--5, AUROC $\approx 0.75$--$0.77$). GLM (single-stream early-fusion) peaks in layers 15--20. Grounding layer selection must be architecture-specific.
  • Figure 4: Three trajectory patterns of visual grounding across seven generation positions. Correct (solid) vs. incorrect (dashed) with 95% CI bands. Sustained deficit (HallusionBench): correct maintains higher $V$ throughout. Crossover (MathVista): incorrect starts higher, then crosses below during reasoning. Moderate (MMMU_Pro): smaller, less consistent separations. Exact effect sizes and significance tests are reported in Appendix \ref{['app:trajectory_stats']}.
  • Figure 5: Task-conditional interaction coefficients ($\beta_{EV}$). Forest plot with bootstrap 95% CIs. Negative $\beta_{EV}$ = confident-but-blind penalty. MMMU_Pro shows significant penalties for GLM ($-0.41$) and Qwen-8B ($-0.61$); MathVista is near zero or positive. This sign flip explains why global linear fusion fails.
  • ...and 1 more figures