Large Language Models to Identify Social Determinants of Health in Electronic Health Records

Marco Guevara; Shan Chen; Spencer Thomas; Tafadzwa L. Chaunzwa; Idalid Franco; Benjamin Kann; Shalini Moningi; Jack Qian; Madeleine Goldstein; Susan Harper; Hugo JWL Aerts; Guergana K. Savova; Raymond H. Mak; Danielle S. Bitterman

Large Language Models to Identify Social Determinants of Health in Electronic Health Records

Marco Guevara, Shan Chen, Spencer Thomas, Tafadzwa L. Chaunzwa, Idalid Franco, Benjamin Kann, Shalini Moningi, Jack Qian, Madeleine Goldstein, Susan Harper, Hugo JWL Aerts, Guergana K. Savova, Raymond H. Mak, Danielle S. Bitterman

TL;DR

The best-fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models in the zero- and few-shot setting, except GPT4 with 10-shot prompting for adverse SDoH.

Abstract

Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data. 800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated. The study also experimented with synthetic data generation and assessed for algorithmic bias. Our best-performing models were fine-tuned Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The benefit of augmenting fine-tuning with synthetic data varied across model architecture and size, with smaller Flan-T5 models (base and large) showing the greatest improvements in performance (delta F1 +0.12 to +0.23). Model performance was similar on the in-hospital system dataset but worse on the MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models for both tasks. These fine-tuned models were less likely than ChatGPT to change their prediction when race/ethnicity and gender descriptors were added to the text, suggesting less algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can effectively extracted SDoH information from clinic notes, performing better compare to GPT zero- and few-shot settings. These models could enhance real-world evidence on SDoH and aid in identifying patients needing social support.

Large Language Models to Identify Social Determinants of Health in Electronic Health Records

TL;DR

The best-fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models in the zero- and few-shot setting, except GPT4 with 10-shot prompting for adverse SDoH.

Abstract

Paper Structure (28 sections, 6 figures, 13 tables)

This paper contains 28 sections, 6 figures, 13 tables.

Large Language Models to Identify Social Determinants of Health in Electronic Health Records
INTRODUCTION
MATERIALS AND METHODS
Data
Task definition and data labeling
Data augmentation
Synthetic test set generation
Model development
Ablation studies
Evaluation
ChatGPT-family model evaluation
Language model bias evaluation
Comparison with structured EHR data
RESULTS
Model performance
...and 13 more sections

Figures (6)

Figure 1: Illustration of generating and comparing synthetic demographic-injected SDoH language pairs to assess how adding race/ethnicity and gender information into a sentence may impact model performance. FT = fine-tuned.
Figure 2: Performance in Macro F1 of Flan-T5 XL models fine-tuned using gold data only (orange line) and gold and synthetic data (blue line), as gold-labeled sentences are gradually reduced by undersample value from the training dataset for the (a) any social determinant of health (SDoH) mention task and (b) adverse SDoH mention task. The full gold-labeled training set is comprised of 29,869 sentences, augmented with 1,800 synthetic SDoH sentences.
Figure 3: Performance Comparison of Best FlanT5 Model Againist GPTs on Sythetic Testset
Figure 4: The proportion of synthetic sentence pairs with and without demographics injected that led to a classification mismatch, meaning that the model predicted a different SDoH label for each sentence in the pair. Overall, Results are shown across race/ethnicity and gender for (a) adverse SDoH mention task and (b) any SDoH mention task. Asterisks indicate statistical significance ($P \leq 0.05$ ).
Figure 5: Figure B1. Class-wise and Macro-F1 scores of our best-performing model against mapped Z-Codes at the patient level (on test set and dev set).
...and 1 more figures

Large Language Models to Identify Social Determinants of Health in Electronic Health Records

TL;DR

Abstract

Large Language Models to Identify Social Determinants of Health in Electronic Health Records

Authors

TL;DR

Abstract

Table of Contents

Figures (6)