Don't Blink: Evidence Collapse during Multimodal Reasoning

Suresh Raghu; Satwik Pandey

Don't Blink: Evidence Collapse during Multimodal Reasoning

Suresh Raghu, Satwik Pandey

Abstract

Reasoning VLMs can become more accurate while progressively losing visual grounding as they think. This creates task-conditional danger zones where low-entropy predictions are confident but ungrounded, a failure mode text-only monitoring cannot detect. Evaluating three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro, we find a pervasive evidence-collapse phenomenon: attention to annotated evidence regions drops substantially, often losing over half of evidence mass, as reasoning unfolds. Full-response entropy is the most reliable text-only uncertainty signal under cross-dataset transfer, yet adding vision features with a single global linear rule is brittle and often degrades transfer. An entropy-vision interaction model reveals a task-conditional regime: lowentropy, visually disengaged predictions are hazardous on sustained visual-reference tasks but benign on symbolic tasks. Using this structure, a targeted vision veto reduces selective risk by up to 1.9 percentage points at 90% coverage, while avoiding degradations where disengagement is expected. The results support task-aware multimodal monitoring for safe deployment under distribution shift.

Don't Blink: Evidence Collapse during Multimodal Reasoning

Abstract

Don't Blink: Evidence Collapse during Multimodal Reasoning

Abstract

Paper Structure

Table of Contents

Figures (6)