Table of Contents
Fetching ...

Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

Kaitao Chen, Shaohao Rui, Yankai Jiang, Jiamin Wu, Qihao Zheng, Chunfeng Song, Xiaosong Wang, Mu Zhou, Mianxin Liu

TL;DR

ViTAR tackles the gap between static image-question reasoning in medical VLMs and clinicians' iterative, region-focused workflows. It introduces a think-act-rethink-answer framework and supports it with a two-stage training regime (SFT and GRPO-based RL) plus curated data: 1K interaction-instruction samples and 16K VQA samples; It demonstrates state-of-the-art performance on seven medical VQA benchmarks, with notable gains in intrinsic reasoning and robust visual grounding. It also analyzes attention dynamics showing second-round reasoning sharpens grounding in clinically critical regions while maintaining token-level attention. This work advances trustworthy, expert-style AI in healthcare by enabling dynamic visual reasoning without heavy external tool dependencies.

Abstract

Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViTAR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of "think-act-rethink-answer". ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViTAR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the "think" to "rethink" rounds, ViTAR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI.

Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

TL;DR

ViTAR tackles the gap between static image-question reasoning in medical VLMs and clinicians' iterative, region-focused workflows. It introduces a think-act-rethink-answer framework and supports it with a two-stage training regime (SFT and GRPO-based RL) plus curated data: 1K interaction-instruction samples and 16K VQA samples; It demonstrates state-of-the-art performance on seven medical VQA benchmarks, with notable gains in intrinsic reasoning and robust visual grounding. It also analyzes attention dynamics showing second-round reasoning sharpens grounding in clinically critical regions while maintaining token-level attention. This work advances trustworthy, expert-style AI in healthcare by enabling dynamic visual reasoning without heavy external tool dependencies.

Abstract

Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViTAR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of "think-act-rethink-answer". ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViTAR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the "think" to "rethink" rounds, ViTAR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI.

Paper Structure

This paper contains 42 sections, 1 equation, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Comparison between the vanilla VLM and ViTAR. ViTAR specializes in explicit visual grounding with iterative reasoning. ViTAR first observes the image input, initiates an action to highlight key regions, and reasons over these regions to reach the final conclusion.
  • Figure 2: ViTAR's framework of visual thinking and action-centric reasoning. In supervised fine‑tuning, ViTAR is trained with structured instructions to mimic expert‑like reasoning patterns and region‑marking behaviors. In Stage II, ViTAR is further optimized with rewards by reinforcement learning, shifting from imitation to autonomous decision refinement.
  • Figure 3: Compared to the first "think” statue (round 1), the second "rethink” statue (round 2) achieves more precise alignment with annotated lesion regions and allocates a higher proportion of attention to visual tokens. And ViTAR outperforms Lingshu by sustaining focused attention on verifiable regions and allocating more visual attention. See Appendix Figure \ref{['fig:allocation']} for more details.
  • Figure 4: Performance comparison with human annotation.
  • Figure 5: Comparison with reasoning efficiency (Times: S).
  • ...and 16 more figures