Table of Contents
Fetching ...

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Huan Zheng, Yucheng Zhou, Tianyi Yan, Dubing Chen, Hongbo Lu, Wenlong Liao, Tao He, Pai Peng, Jianbing Shen

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.
Paper Structure (22 sections, 2 theorems, 14 equations, 6 figures, 2 tables)

This paper contains 22 sections, 2 theorems, 14 equations, 6 figures, 2 tables.

Key Result

theorem 1

Let $K(Z_e) < K(Z_c)$. Under gradient descent optimization of the SFT loss $\mathcal{L}$, the model parameters $\mathbf{w} = [\mathbf{w}_c; \mathbf{w}_e]$ satisfy $\| \mathbf{w}_e \| > \| \mathbf{w}_c \|$, leading to $\mathcal{S}_e > \mathcal{S}_c$.

Figures (6)

  • Figure 1: Illustration of the motivation. (a) Existing methods suffer from clinical cognition misalignment. (b) Our CogAlign framework enforces a strict clinical cognitive flow. (c) A representative failure case generated by Gemini 3 Pro. (d) A radar chart highlighting the superior accuracy of CogAlign across diverse benchmarks.
  • Figure 2: Overview of the dataset curation pipeline. (a) shows the collection and filtering of diverse endoscopic images. (b) shows the generation of hierarchical clinical cognition reasoning chains. (c) shows the human expert refinement process to eliminate hallucinations. (d) shows a generated sample example.
  • Figure 3: Overview of the proposed CogAlign framework. The pipeline consists of two fundamental stages. Left panel demonstrates the clinical cognitive reasoning alignment phase, where the multimodal large language model undergoes supervised fine tuning. Right panel details the reinforcement learning phase guided by counterfactuals.
  • Figure 4: Case study between CogAlign and baseline models. The top row demonstrates CogAlign's ability to detect a subtle polyp via hierarchical clinical cognition, whereas the general model (Qwen3-VL-Plus) fails. The bottom row highlights CogAlign's robustness to visual noise in identifying erosion, where the Base-SFT model hallucinates a normal diagnosis due to a lack of causal grounding.
  • Figure 5: Detailed analysis of model robustness and counterfactual masking strategies. (a) Performance degradation under spot interference. CogAlign demonstrates superior robustness against visual perturbation, exhibiting a significantly lower accuracy drop than SFT baselines. (b) Comparison of masking techniques. Employing Gaussian blur to erase lesion features yields better diagnostic accuracy than solid white masking, validating its effectiveness for causal rectification.
  • ...and 1 more figures

Theorems & Definitions (6)

  • definition 1: Latent Factor Model
  • definition 2: Effective Feature Sensitivity
  • theorem 1: Shortcut Convergence in SFT
  • proof
  • theorem 2: Causal Rectification via Counterfactual Penalty
  • proof