Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Huan Zheng; Yucheng Zhou; Tianyi Yan; Dubing Chen; Hongbo Lu; Wenlong Liao; Tao He; Pai Peng; Jianbing Shen

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Huan Zheng, Yucheng Zhou, Tianyi Yan, Dubing Chen, Hongbo Lu, Wenlong Liao, Tao He, Pai Peng, Jianbing Shen

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Abstract

Paper Structure (22 sections, 2 theorems, 14 equations, 6 figures, 2 tables)

This paper contains 22 sections, 2 theorems, 14 equations, 6 figures, 2 tables.

Introduction
Related Work
Medical Multimodal Large Language Models
Gastrointestinal Disease Diagnosis
Hierarchical Clinical Cognition Dataset
Clinical Cognitive Hierarchy Definition
Human-in-the-Loop Curation Pipeline
Dataset Overview
Methodology
Problem Definition
Clinical-Cognitive Reasoning Alignment
Theoretical Analysis: Visual-Cognitive Misalignment and Causal Rectification
Counterfactual-Driven GRPO for Causal Alignment
Experiments
Experiment Setup
...and 7 more sections

Key Result

theorem 1

Let $K(Z_e) < K(Z_c)$. Under gradient descent optimization of the SFT loss $\mathcal{L}$, the model parameters $\mathbf{w} = [\mathbf{w}_c; \mathbf{w}_e]$ satisfy $\| \mathbf{w}_e \| > \| \mathbf{w}_c \|$, leading to $\mathcal{S}_e > \mathcal{S}_c$.

Figures (6)

Figure 1: Illustration of the motivation. (a) Existing methods suffer from clinical cognition misalignment. (b) Our CogAlign framework enforces a strict clinical cognitive flow. (c) A representative failure case generated by Gemini 3 Pro. (d) A radar chart highlighting the superior accuracy of CogAlign across diverse benchmarks.
Figure 2: Overview of the dataset curation pipeline. (a) shows the collection and filtering of diverse endoscopic images. (b) shows the generation of hierarchical clinical cognition reasoning chains. (c) shows the human expert refinement process to eliminate hallucinations. (d) shows a generated sample example.
Figure 3: Overview of the proposed CogAlign framework. The pipeline consists of two fundamental stages. Left panel demonstrates the clinical cognitive reasoning alignment phase, where the multimodal large language model undergoes supervised fine tuning. Right panel details the reinforcement learning phase guided by counterfactuals.
Figure 4: Case study between CogAlign and baseline models. The top row demonstrates CogAlign's ability to detect a subtle polyp via hierarchical clinical cognition, whereas the general model (Qwen3-VL-Plus) fails. The bottom row highlights CogAlign's robustness to visual noise in identifying erosion, where the Base-SFT model hallucinates a normal diagnosis due to a lack of causal grounding.
Figure 5: Detailed analysis of model robustness and counterfactual masking strategies. (a) Performance degradation under spot interference. CogAlign demonstrates superior robustness against visual perturbation, exhibiting a significantly lower accuracy drop than SFT baselines. (b) Comparison of masking techniques. Employing Gaussian blur to erase lesion features yields better diagnostic accuracy than solid white masking, validating its effectiveness for causal rectification.
...and 1 more figures

Theorems & Definitions (6)

definition 1: Latent Factor Model
definition 2: Effective Feature Sensitivity
theorem 1: Shortcut Convergence in SFT
proof
theorem 2: Causal Rectification via Counterfactual Penalty
proof

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Abstract

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)