Table of Contents
Fetching ...

ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models

Xiwei Liu, Yulong Li, Xinlin Zhuang, Xuhui Li, Jianxu Chen, Haolin Yang, Imran Razzak, Yutong Xie

TL;DR

ClinCoT is proposed, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning and achieves superior performance compared with existing preference-based alignment methods.

Abstract

Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model's policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.

ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models

TL;DR

ClinCoT is proposed, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning and achieves superior performance compared with existing preference-based alignment methods.

Abstract

Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model's policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
Paper Structure (9 sections, 8 equations, 2 figures, 2 tables)

This paper contains 9 sections, 8 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the ClinCoT workflow. Given a subset of input pairs $\mathcal{X}_i \subset \{\mathcal{X}_1,\ldots,\mathcal{X}_m\}$, ClinCoT first employs a clinical-aware tool with a predefined hypotheses set $\mathcal{P}$ to generate region proposals $\{r_i\}_{i=1}^n$. The target model $f_{\mathsf{tar}}^i$ produces region-conditioned reasoning chains $\{y_t^i = CoT_t^i\}_{i=1}^n$ based on the preserved preferred history $y_{0:t-1}=\{CoT_0^6,\ldots,CoT_{t-1}^1\}$, integrating both the original image and the candidate regions. Med-LLM evaluators assign scores to construct preference pairs $\mathcal{D}_i$, distinguishing preferred and dispreferred responses. A consensus-weighted scoring based optimization updates $f_{\mathsf{tar}}^i$ to $f_{\mathsf{tar}}^{i+1}$ through iterative training. The updated model $f_{\mathsf{tar}}^{i+1}$ is then applied to $\mathcal{X}_{i+1}$ to generate new preference pairs $\mathcal{D}_{i+1}$ for the next iteration.
  • Figure 2: Visualization of generated preference data.