Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data
Yu Leng, Yingnan He, Colin Magdamo, Ana-Maria Vranceanu, Christine S. Ritchie, Shibani S. Mukerji, Lidia M. V. R. Moura, John R. Dickson, Deborah Blacker, Sudeshna Das
TL;DR
This work investigates GPT-4o's ability to identify stages of cognitive impairment from unstructured electronic health records. Using two real-world datasets—MGH memory-clinic notes for global CDR scoring and a 3-year Medicare note set for syndromic staging (NC/MCI/dementia)—the authors evaluate zero-shot GPT-4o with and without retrieval augmentation and prompt-engineering techniques. The model achieves high agreement with clinician labels, notably a weighted kappa of 0.83 on the memory-clinic task and 0.91 (0.96 for high-confidence cases) on the Medicare task, indicating strong potential for scalable chart reviews and clinical support. However, the study also highlights biases in documentation and access, underscoring the need for multi-institution validation and bias-mitigation strategies before deployment in routine care.
Abstract
Identifying cognitive impairment within electronic health records (EHRs) is crucial not only for timely diagnoses but also for facilitating research. Information about cognitive impairment often exists within unstructured clinician notes in EHRs, but manual chart reviews are both time-consuming and error-prone. To address this issue, our study evaluates an automated approach using zero-shot GPT-4o to determine stage of cognitive impairment in two different tasks. First, we evaluated the ability of GPT-4o to determine the global Clinical Dementia Rating (CDR) on specialist notes from 769 patients who visited the memory clinic at Massachusetts General Hospital (MGH), and achieved a weighted kappa score of 0.83. Second, we assessed GPT-4o's ability to differentiate between normal cognition, mild cognitive impairment (MCI), and dementia on all notes in a 3-year window from 860 Medicare patients. GPT-4o attained a weighted kappa score of 0.91 in comparison to specialist chart reviews and 0.96 on cases that the clinical adjudicators rated with high confidence. Our findings demonstrate GPT-4o's potential as a scalable chart review tool for creating research datasets and assisting diagnosis in clinical settings in the future.
