Table of Contents
Fetching ...

MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models

Jianxin Lin, Chunzheng Zhu, Peter J. Kneuertz, Yunfei Bai, Yuan Xue

Abstract

Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reasoning: how to adaptively trigger causal correction, construct high-quality causal-spurious contrastive samples, and maintain causal consistency across reasoning trajectories. To address these challenges, we propose MedCausalX, an end-to-end framework explicitly models causal reasoning chains in medical VLMs. We first introduce the CRMed dataset providing fine-grained anatomical annotations, structured causal reasoning chains, and counterfactual variants that guide the learning of causal relationships beyond superficial correlations. Building upon CRMed, MedCausalX employs a two-stage adaptive reflection architecture equipped with $\langle$causal$\rangle$ and $\langle$verify$\rangle$ tokens, enabling the model to autonomously determine when and how to perform causal analysis and verification. Finally, a trajectory-level causal correction objective optimized through error-attributed reinforcement learning refines the reasoning chain, allowing the model to distinguish genuine causal dependencies from shortcut associations. Extensive experiments on multiple benchmarks show that MedCausalX consistently outperforms state-of-the-art methods, improving diagnostic consistency by +5.4 points, reducing hallucination by over 10 points, and attaining top spatial grounding IoU, thereby setting a new standard for causally grounded medical reasoning.

MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models

Abstract

Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reasoning: how to adaptively trigger causal correction, construct high-quality causal-spurious contrastive samples, and maintain causal consistency across reasoning trajectories. To address these challenges, we propose MedCausalX, an end-to-end framework explicitly models causal reasoning chains in medical VLMs. We first introduce the CRMed dataset providing fine-grained anatomical annotations, structured causal reasoning chains, and counterfactual variants that guide the learning of causal relationships beyond superficial correlations. Building upon CRMed, MedCausalX employs a two-stage adaptive reflection architecture equipped with causal and verify tokens, enabling the model to autonomously determine when and how to perform causal analysis and verification. Finally, a trajectory-level causal correction objective optimized through error-attributed reinforcement learning refines the reasoning chain, allowing the model to distinguish genuine causal dependencies from shortcut associations. Extensive experiments on multiple benchmarks show that MedCausalX consistently outperforms state-of-the-art methods, improving diagnostic consistency by +5.4 points, reducing hallucination by over 10 points, and attaining top spatial grounding IoU, thereby setting a new standard for causally grounded medical reasoning.
Paper Structure (33 sections, 11 equations, 12 figures, 11 tables)

This paper contains 33 sections, 11 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Existing medical VLMs often lack causality-awareness, which manifests as (1) spurious CoT reasoning attends to mislocalized anatomy or (2) overlooking causal evidence for pathology diagnosis. MedCausalX can achieve trustworthy diagnosis by adaptive causal verification and correction during inference.
  • Figure 2: Overview of MedCausalX framework. We first construct CRMed dataset via fine-grained annotations, then factorize reasoning via a structural causal model (SCM) with reflective tokens (<causal>, <verify>) enabling two-stage adaptive reasoning. After causal Supervised Fine-Tuning (SFT), MedCausalX is subsequently optimized via dual-policy that couples off-policy preference alignment (DPO) with on-policy causal reinforcement (GRPO), bridging stability and adaptability to enhance the model's causal reasoning capability.
  • Figure 3: Comparison of causal reasoning in medical VQA and region-grounded diagnosis. (a) MedRegA and Med-R1 exhibit spurious correlations and mistakes, while MedCausalX captures correct predictions via structured causal decomposition. (b) Two-stage adaptive causal reasoning and verification, producing refined bounding boxes and logically consistent diagnostic descriptions. Attention maps are used to visualize the model's focus. GT: ground truth; A1--A3: predictions from MedCausalX, Med-R1, and MedRegA, respectively.
  • Figure 4: Radiologists' evaluation and comparative results. Left: 3-point clinical criteria for spatial localization (Q1), diagnostic description (Q2). Right: radar chart comparing MedRegA, Med-R1, and MedCausalX on localization (blue) and description (orange); lower scores indicate better performance.
  • Figure 5: Modality distribution and factual-counterfactual composition of CRMed dataset. The dataset spans multiple imaging modalities with constructed counterfactual variants through controlled interventions, enabling robust causal reasoning evaluation.
  • ...and 7 more figures