Table of Contents
Fetching ...

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang

TL;DR

It is suggested that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains, beyond extending text-driven reasoning chains.

Abstract

Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

TL;DR

It is suggested that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains, beyond extending text-driven reasoning chains.

Abstract

Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.
Paper Structure (12 sections, 6 equations, 3 figures, 2 tables)

This paper contains 12 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The three-stage Medical VLM CoT framework and targeted interventions.
  • Figure 2: Main results across RQ1--RQ3. (a) CoT improves general benchmarks but degrades medical benchmarks. (b) CoT is more sensitive to progressive visual degradation than DirA. (c) Supplementing models with expert-level image descriptions alone effectively mitigates CoT degradation. (d,e) Counterfactual inputs reveal pseudo-robustness in DirA and stronger visual dependence in CoT. (f) Incorrect RoI and descriptions degrade CoT performance, confirming its reliance on accurate visual grounding.
  • Figure 3: Qualitative case study. Standard CoT exhibits misaligned attention patterns and incorrect conclusions, while grounded interventions provide additional spatial and semantic priors that yield more visually consistent reasoning trajectories.