Table of Contents
Fetching ...

AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer's Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study

Wenlong Hou, Sheng Bi, Guangqian Yang, Lihao Liu, Ye Du, Hanxiao Xue, Juncheng Wang, Yuxiang Feng, Yue Xun, Nanxi Yu, Ning Mao, Mo Yang, Yi Wah Eva Cheung, Ling Long, Kay Chen Tan, Lequan Yu, Xiaomeng Ma, Shaozhen Yan, Shujun Wang

Abstract

Alzheimer's disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease-specific questions rather than generating comprehensive diagnostic reports that support clinical decision-making. Here we expand LLM capabilities for clinical decision support by introducing AD-CARE, a modality-agnostic agent that performs guideline-grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM-driven reasoning, AD-CARE generates transparent, report-style outputs aligned with real-world clinical workflows. Across six cohorts comprising 10,303 cases, AD-CARE achieved 84.9% diagnostic accuracy, delivering 4.2%-13.7% relative improvements over baseline methods. Despite cohort-level differences, dataset-specific accuracies remain robust (80.4%-98.8%), and the agent consistently outperforms all baselines. AD-CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%-68% and 28%-51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%-11% and more than halved decision time. The framework yielded 2.29%-10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.

AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer's Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study

Abstract

Alzheimer's disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease-specific questions rather than generating comprehensive diagnostic reports that support clinical decision-making. Here we expand LLM capabilities for clinical decision support by introducing AD-CARE, a modality-agnostic agent that performs guideline-grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM-driven reasoning, AD-CARE generates transparent, report-style outputs aligned with real-world clinical workflows. Across six cohorts comprising 10,303 cases, AD-CARE achieved 84.9% diagnostic accuracy, delivering 4.2%-13.7% relative improvements over baseline methods. Despite cohort-level differences, dataset-specific accuracies remain robust (80.4%-98.8%), and the agent consistently outperforms all baselines. AD-CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%-68% and 28%-51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%-11% and more than halved decision time. The framework yielded 2.29%-10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.

Paper Structure

This paper contains 31 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: AD-CARE Agent framework and overall strategy.(a), Our AD-CARE for AD diagnosis was developed using multi-modal data, including individual-level demographics, imaging, neurological tests, genetic information, functional evaluation, and biospecimen results. Multi-modal data are processed through the AD-CARE with three components (reasoning engine, outcome aggregator, and specialized executors), generating multi-modal outputs (disgnosis result, confidence, diagnosis report, and visualization results). (b), Agent workflow: Given a use query, the framework performs reasoning in four stages: (i) observation, (ii) thought, (iii) action, and (iv) aggregation. (c), Validation on six diverse populations (n=10,303) including four public datests and two in-house cohorts: We first evaluated AD-CARE against baseline methods using four metrics. We then assessed fairness with respect to race and age. Next, we conducted reader study with agent augmentation. Finally, we benchmarked AD-CARE by using eight representative LLM backbones.
  • Figure 2: Performance comparison of AD-CARE and baseline models across six benchmark cohorts. AD-CARE achieves superior accuracy and F1 scores on most benchmarks, with notable gains on heterogeneous cohorts (ADNI, OASIS, NACC) and robust performance on in-house cohorts (SYSUH, XWH), underscoring its potential for real-world application.
  • Figure 3: Fairness analysis of AD-CARE and baseline methods.(a), Racial subgroups (Asian, Black, White). (b), Age subgroups (<65, 65–74, 75–84, $\geq$85). Bars show subgroup performance on four metrics. Lines (right axis) show fairness dispersion (standard deviation and max–min gap across subgroups). AD-CARE delivers both high diagnostic performance and lower variability across demographic groups compared with baseline methods, indicating improved robustness and fairness across race and age.
  • Figure 4: AD-CARE assistance improves clinicians’ diagnostic accuracy and efficiency.(a), Diagnostic performance of neurologists and radiologists with and without agent assistance, stratified by seniority level. Points denote mean performance estimates for doctor-only reads and doctor-plus-agent reads, and error bars indicate 95% confidence intervals obtained by bootstrap resampling. Metrics include accuracy, F1 score, sensitivity and specificity. Across both specialties and experience levels, access to the agent yields consistent gains in all metrics. (b), Effect of AD-CARE assistance on per-case reading time. Violin plots depict the distribution of decision times for unaided clinicians (blue) and agent-assisted clinicians (orange) overall and within each subgroup and site (SYSUH neurologists, XWH radiologists). Boxes summarize median and mean times, and inset annotations report mean ± 95% CI and the corresponding efficiency gain (ratio of unaided to assisted time). AD-CARE assistance substantially shortens reading time for both neurologists and radiologists while preserving or improving diagnostic performance.
  • Figure 5: Benchmark comparison of AD-CARE with raw LLM backbones and cost–accuracy trade-off analysis.(a), Accuracy of standalone LLM backbones versus the corresponding LLM-powered AD-CARE system. Numbers indicate the absolute accuracy gain achieved by AD-CARE over the raw LLM output for each backbone. Across all eight models, AD-CARE consistently improves diagnostic accuracy, demonstrating the effectiveness of the framework. (b), AD-CARE accuracy versus overall inference cost for each instantiated backbone. Point colors denote the LLM provider. The dashed line denotes the Pareto frontier of non-dominated backbones, for which no alternative achieves higher accuracy at lower cost. Bubble diameter is proportional to the relative improvement ratio.
  • ...and 1 more figures