Table of Contents
Fetching ...

Leveraging Pre-trained and Transformer-derived Embeddings from EHRs to Characterize Heterogeneity Across Alzheimer's Disease and Related Dementias

Matthew West, Colin Magdamo, Lily Cheng, Yingnan He, Sudeshna Das

TL;DR

This study addresses heterogeneity in Alzheimer's disease and related dementias (ADRD) by applying unsupervised learning to electronic health records. It uses two representation streams: pre-trained ICD-code embeddings and transformer-derived Clinical BERT embeddings of neurology notes, clustering patient vectors with Ward's hierarchical method. The analysis reveals three latent subtypes per modality, with interpretable enrichment in comorbidities (e.g., chronic pain and vascular disease) and textual cues (e.g., hydrocephalus, neuropsychiatric terms), implying heterogeneity beyond AD coding alone. Limitations include lack of external validation and reliance on billing codes; the authors suggest joint multi-modal and temporally-aware approaches to better capture disease trajectories and etiologies.

Abstract

Alzheimer's disease is a progressive, debilitating neurodegenerative disease that affects 50 million people globally. Despite this substantial health burden, available treatments for the disease are limited and its fundamental causes remain poorly understood. Previous work has suggested the existence of clinically-meaningful sub-types, which it is suggested may correspond to distinct etiologies, disease courses, and ultimately appropriate treatments. Here, we use unsupervised learning techniques on electronic health records (EHRs) from a cohort of memory disorder patients to characterise heterogeneity in this disease population. Pre-trained embeddings for medical codes as well as transformer-derived Clinical BERT embeddings of free text are used to encode patient EHRs. We identify the existence of sub-populations on the basis of comorbidities and shared textual features, and discuss their clinical significance.

Leveraging Pre-trained and Transformer-derived Embeddings from EHRs to Characterize Heterogeneity Across Alzheimer's Disease and Related Dementias

TL;DR

This study addresses heterogeneity in Alzheimer's disease and related dementias (ADRD) by applying unsupervised learning to electronic health records. It uses two representation streams: pre-trained ICD-code embeddings and transformer-derived Clinical BERT embeddings of neurology notes, clustering patient vectors with Ward's hierarchical method. The analysis reveals three latent subtypes per modality, with interpretable enrichment in comorbidities (e.g., chronic pain and vascular disease) and textual cues (e.g., hydrocephalus, neuropsychiatric terms), implying heterogeneity beyond AD coding alone. Limitations include lack of external validation and reliance on billing codes; the authors suggest joint multi-modal and temporally-aware approaches to better capture disease trajectories and etiologies.

Abstract

Alzheimer's disease is a progressive, debilitating neurodegenerative disease that affects 50 million people globally. Despite this substantial health burden, available treatments for the disease are limited and its fundamental causes remain poorly understood. Previous work has suggested the existence of clinically-meaningful sub-types, which it is suggested may correspond to distinct etiologies, disease courses, and ultimately appropriate treatments. Here, we use unsupervised learning techniques on electronic health records (EHRs) from a cohort of memory disorder patients to characterise heterogeneity in this disease population. Pre-trained embeddings for medical codes as well as transformer-derived Clinical BERT embeddings of free text are used to encode patient EHRs. We identify the existence of sub-populations on the basis of comorbidities and shared textual features, and discuss their clinical significance.
Paper Structure (12 sections, 4 equations, 5 figures, 4 tables)

This paper contains 12 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Cohort selection process. A flowchart showing the cohort selection process for the our patient cohort.
  • Figure 2: Notes preprocessing pipeline vizualization. A schematic diagram showing the note processing pipeline for a 512-token input sequence to Clinical BERT.
  • Figure 3: UMAP projections of patient cohort across both representations. UMAP projections of the ICD representation coloured by cluster membership (a) and patients having Alzheimer's ICD9 code (b), as well as the corresponding projection for the patient notes representation (c, d).
  • Figure 4: ICD clustering heatmap. A heatmap showing ICD9 code enrichment for clusters found in the ICD clustering. The colour of each code-cluster pair is determined by the normalized Z-score for that code.
  • Figure 5: Notes clustering heatmap. A heatmap showing ICD9 code enrichment for clusters found in the patient notes clustering. The colour of each code-cluster pair is determined by the normalized Z-score for that code.