EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease
Qiuhui Chen, Xuancheng Yao, Zhenglei Zhou, Xinyue Hu, Yi Hong
TL;DR
The paper presents EMAD, a vision–language framework for Alzheimer's disease diagnosis that grounds every narrative claim in clinical evidence and localized brain anatomy. It introduces Sentence–Evidence–Anatomy (SEA) Grounding to link sentences to clinical data and 3D MRI masks, GTX-Distill for label-efficient grounding transfer, and Executable-Rule GRPO with verifiable rewards to enforce clinical consistency and adherence to NIA-AA criteria. Evaluated on the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic performance while producing more transparent, anatomically faithful reports. This work advances trustworthy medical vision–language systems and provides resources to support future research in evidence-grounded clinical reporting.
Abstract
Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.
