Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Yuan Wu; Zongxian Yang; Jiayu Qian; Songpan Gao; Guanxing Chen; Qiankun Li; Yu-An Huang; Zhi-An Huang

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang

TL;DR

It is suggested that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains, beyond extending text-driven reasoning chains.

Abstract

Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

TL;DR

It is suggested that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains, beyond extending text-driven reasoning chains.

Abstract

Paper Structure (12 sections, 6 equations, 3 figures, 2 tables)

This paper contains 12 sections, 6 equations, 3 figures, 2 tables.

Introduction
Methodology
A Three-Stage View of Perception Bottlenecks in CoT
Bridging the Modality Gap: Spatial and Semantic Interventions
Experiments
Experimental Setup
Benchmarks and Models.
Implementation Details.
Does the success of CoT in general tasks transfer to medical VQA?
Is CoT reasoning critically bounded by visual perception?
Can bridging the perception gap reactivate the reasoning potential of medical VLMs?
Conclusion and Discussion

Figures (3)

Figure 1: The three-stage Medical VLM CoT framework and targeted interventions.
Figure 2: Main results across RQ1--RQ3. (a) CoT improves general benchmarks but degrades medical benchmarks. (b) CoT is more sensitive to progressive visual degradation than DirA. (c) Supplementing models with expert-level image descriptions alone effectively mitigates CoT degradation. (d,e) Counterfactual inputs reveal pseudo-robustness in DirA and stronger visual dependence in CoT. (f) Incorrect RoI and descriptions degrade CoT performance, confirming its reliance on accurate visual grounding.
Figure 3: Qualitative case study. Standard CoT exhibits misaligned attention patterns and incorrect conclusions, while grounded interventions provide additional spatial and semantic priors that yield more visually consistent reasoning trajectories.

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

TL;DR

Abstract

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Authors

TL;DR

Abstract

Table of Contents

Figures (3)