ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias
Guangxin Zhao, Jiahao Zheng, Malaz Boustani, Jarek Nabrzyski, Meng Jiang, Yiyu Shi, Zhi Zheng
TL;DR
ADRD-Bench introduces the first ADRD-focused LLM evaluation dataset, combining 1,352 knowledge questions from seven benchmarks with 149 caregiving questions derived from the ABC program to assess both clinical knowledge and practical caregiving reasoning. The study benchmarks 33 LLMs across open-weight and closed-source categories, revealing size and domain-tuning effects: larger models generally perform better, domain-focused models excel in clinical QA but caregiving tasks benefit from broad commonsense ability. Correlations across QA sets show that processing of clinical knowledge often aligns with caregiving performance, but only for open-weight medical models is this robust. Case analyses uncover systematic failures like overgeneralization and misinterpretation of disengagement cues, underscoring the need for domain-specific alignment and safer, more realistic ADRD AI systems. The dataset and findings provide a stepping stone for targeted evaluation and community-driven expansion toward safer ADRD AI in real-world care.
Abstract
Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.
