Table of Contents
Fetching ...

EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease

Qiuhui Chen, Xuancheng Yao, Zhenglei Zhou, Xinyue Hu, Yi Hong

TL;DR

The paper presents EMAD, a vision–language framework for Alzheimer's disease diagnosis that grounds every narrative claim in clinical evidence and localized brain anatomy. It introduces Sentence–Evidence–Anatomy (SEA) Grounding to link sentences to clinical data and 3D MRI masks, GTX-Distill for label-efficient grounding transfer, and Executable-Rule GRPO with verifiable rewards to enforce clinical consistency and adherence to NIA-AA criteria. Evaluated on the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic performance while producing more transparent, anatomically faithful reports. This work advances trustworthy medical vision–language systems and provides resources to support future research in evidence-grounded clinical reporting.

Abstract

Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.

EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease

TL;DR

The paper presents EMAD, a vision–language framework for Alzheimer's disease diagnosis that grounds every narrative claim in clinical evidence and localized brain anatomy. It introduces Sentence–Evidence–Anatomy (SEA) Grounding to link sentences to clinical data and 3D MRI masks, GTX-Distill for label-efficient grounding transfer, and Executable-Rule GRPO with verifiable rewards to enforce clinical consistency and adherence to NIA-AA criteria. Evaluated on the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic performance while producing more transparent, anatomically faithful reports. This work advances trustworthy medical vision–language systems and provides resources to support future research in evidence-grounded clinical reporting.

Abstract

Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.
Paper Structure (18 sections, 13 equations, 5 figures, 6 tables)

This paper contains 18 sections, 13 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Evidence-centric multimodal clinical diagnosis.
  • Figure 2: (a) Overall framework from multimodal inputs (3D image, clinical text) to encoders, linear projections, multimodal fusion, and textual decoding to a report. (b) SEA Grounding produces evidence and anatomy distributions via two-step hierarchical alignment.
  • Figure 3: Grounding Transfer Distillation (GTX Distill).
  • Figure 4: Pipeline for reasoning and grounding generation.
  • Figure 5: Inference example of EMAD. The model integrates sMRI and clinical evidence to produce a grounded reasoning chain and diagnosis, with sentence-level claims linked to clinical fields and localized 3D anatomy.