Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

Salman Rahman; Lavender Yao Jiang; Saadia Gabriel; Yindalon Aphinyanaphongs; Eric Karl Oermann; Rumi Chunara

Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

Salman Rahman, Lavender Yao Jiang, Saadia Gabriel, Yindalon Aphinyanaphongs, Eric Karl Oermann, Rumi Chunara

TL;DR

This study evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s clinical notes, analyzing its performance on 30-day all-cause readmission prediction focusing on variability across hospitals and patient characteristics, and compared local fine-tuning, instance-based augmented fine-tuning and cluster-based fine-tuning for improving generalization.

Abstract

Advances in large language models (LLMs) provide new opportunities in healthcare for improved patient care, clinical decision-making, and enhancement of physician and administrator workflows. However, the potential of these models importantly depends on their ability to generalize effectively across clinical environments and populations, a challenge often underestimated in early development. To better understand reasons for these challenges and inform mitigation approaches, we evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s clinical notes, analyzing its performance on 30-day all-cause readmission prediction focusing on variability across hospitals and patient characteristics. We found poorer generalization particularly in hospitals with fewer samples, among patients with government and unspecified insurance, the elderly, and those with high comorbidities. To understand reasons for lack of generalization, we investigated sample sizes for fine-tuning, note content (number of words per note), patient characteristics (comorbidity level, age, insurance type, borough), and health system aspects (hospital, all-cause 30-day readmission, and mortality rates). We used descriptive statistics and supervised classification to identify features. We found that, along with sample size, patient age, number of comorbidities, and the number of words in notes are all important factors related to generalization. Finally, we compared local fine-tuning (hospital specific), instance-based augmented fine-tuning and cluster-based fine-tuning for improving generalization. Among these, local fine-tuning proved most effective, increasing AUC by 0.25% to 11.74% (most helpful in settings with limited data). Overall, this study provides new insights for enhancing the deployment of large language models in the societally important domain of healthcare, and improving their performance for broader populations.

Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

TL;DR

Abstract

Paper Structure (41 sections, 2 equations, 3 figures, 2 tables)

This paper contains 41 sections, 2 equations, 3 figures, 2 tables.

Introduction
Related Work
Generalization in Clinical Large Language Models (LLMs)
Challenges to Generalization in Large Language Models
Improving Generalization in Large Language Models
Methods: Pretraining, Fine-Tuning, Implementation, Clustering and Analysis Details
Clinical Prediction Task
Pretraining
Fine-tuning
Global Fine-tuning
Local Fine-tuning (Hospital Specific)
Instance-Based Augmented Fine-tuning
Cluster-Based Fine-tuning
Population and Hospital Groups for Generalizability Assessment
Hospital Group
...and 26 more sections

Figures (3)

Figure 1: Workflow for both instance-based matching for identifying patient matches in low-data hospitals (Hospital 3 and Hospital 4) [left], and cluster-based matching to group patients with similar characteristics [right].
Figure 2: Comparative AUC analysis for Hospital 1 and Hospital 2 in random and temporal tests, distinguished by solid and dotted lines respectively, demonstrates that with an equal number of fine-tuning samples, Hospital 1 generalizes better than Hospital 2. The analysis was conducted five times for each sample size using a random seed. The standard deviation for each sample size was almost zero, except for standard deviations of $0.02$ for the Hospital 1 random and temporal tests, both with a sample size of $10^2$.
Figure 3: Visualization and report of important factors distinguishing clusters of similar notes via patient, hospital, and note-level characteristics. The distinguishing factors include hospital-level features like readmission and mortality rates, as well as patient and note-level features: mean comorbidities, age of patient, and number of words in the notes (the top three features in the decision tree).

Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

TL;DR

Abstract

Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (3)