Table of Contents
Fetching ...

ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias

Guangxin Zhao, Jiahao Zheng, Malaz Boustani, Jarek Nabrzyski, Meng Jiang, Yiyu Shi, Zhi Zheng

TL;DR

ADRD-Bench introduces the first ADRD-focused LLM evaluation dataset, combining 1,352 knowledge questions from seven benchmarks with 149 caregiving questions derived from the ABC program to assess both clinical knowledge and practical caregiving reasoning. The study benchmarks 33 LLMs across open-weight and closed-source categories, revealing size and domain-tuning effects: larger models generally perform better, domain-focused models excel in clinical QA but caregiving tasks benefit from broad commonsense ability. Correlations across QA sets show that processing of clinical knowledge often aligns with caregiving performance, but only for open-weight medical models is this robust. Case analyses uncover systematic failures like overgeneralization and misinterpretation of disengagement cues, underscoring the need for domain-specific alignment and safer, more realistic ADRD AI systems. The dataset and findings provide a stepping stone for targeted evaluation and community-driven expansion toward safer ADRD AI in real-world care.

Abstract

Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.

ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias

TL;DR

ADRD-Bench introduces the first ADRD-focused LLM evaluation dataset, combining 1,352 knowledge questions from seven benchmarks with 149 caregiving questions derived from the ABC program to assess both clinical knowledge and practical caregiving reasoning. The study benchmarks 33 LLMs across open-weight and closed-source categories, revealing size and domain-tuning effects: larger models generally perform better, domain-focused models excel in clinical QA but caregiving tasks benefit from broad commonsense ability. Correlations across QA sets show that processing of clinical knowledge often aligns with caregiving performance, but only for open-weight medical models is this robust. Case analyses uncover systematic failures like overgeneralization and misinterpretation of disengagement cues, underscoring the need for domain-specific alignment and safer, more realistic ADRD AI systems. The dataset and findings provide a stepping stone for targeted evaluation and community-driven expansion toward safer ADRD AI in real-world care.

Abstract

Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.
Paper Structure (21 sections, 8 figures, 2 tables)

This paper contains 21 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Model accuracy vs. model size on the ADRD Unified QA. Blue points: open-weight general models; Red points: open-weight medical models; Green points: closed-source general models. Dashed horizontal line: mean accuracy across all models.
  • Figure 2: Model accuracy vs. model size on the ADRD Caregiving QA. Blue points: open-weight general models; Red points: open-weight medical models; Green points: closed-source general models. Dashed horizontal line: mean accuracy across all models.
  • Figure 3: Correlation between accuracies on ADRD Caregiving QA and accuracies on ADRD Unified QA on all models, where larger points mean larger model parameters.
  • Figure 4: Correlation between accuracies on ADRD Caregiving QA and accuracies on ADRD Unified QA on (a) general models; (b) medical models; (c) closed-source models.
  • Figure 5: True/False question example of ADRD Caregiving QA that most LLMs answered incorrectly.
  • ...and 3 more figures