Table of Contents
Fetching ...

Extracting Social Determinants of Health from Pediatric Patient Notes Using Large Language Models: Novel Corpus and Methods

Yujuan Fu, Giridhar Kaushik Ramachandran, Nicholas J Dobbins, Namu Park, Michael Leu, Abby R. Rosenberg, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen

TL;DR

This study addresses extracting pediatric social determinants of health (SDoH) from clinical notes by introducing PedSHAC, a large annotated corpus with ten SDoH event types. It systematically evaluates both fine-tuning transformer models (mSpERT, T5) and in-context learning with GPT-4, using one-step and two-step prompting strategies. The results show that SDoH representations can be extracted with high accuracy, approaching human annotator agreement (e.g., 78.4 F1 for event arguments and 82.3 F1 for triggers with GPT-4 in-context), enabling direct integration into structured EHR data. The work highlights trade-offs between fine-tuning and prompt-based methods, underscores the potential for real-time SDoH extraction in pediatrics, and outlines avenues for expanding the corpus and refining prompts and evaluation.

Abstract

Social determinants of health (SDoH) play a critical role in shaping health outcomes, particularly in pediatric populations where interventions can have long-term implications. SDoH are frequently studied in the Electronic Health Record (EHR), which provides a rich repository for diverse patient data. In this work, we present a novel annotated corpus, the Pediatric Social History Annotation Corpus (PedSHAC), and evaluate the automatic extraction of detailed SDoH representations using fine-tuned and in-context learning methods with Large Language Models (LLMs). PedSHAC comprises annotated social history sections from 1,260 clinical notes obtained from pediatric patients within the University of Washington (UW) hospital system. Employing an event-based annotation scheme, PedSHAC captures ten distinct health determinants to encompass living and economic stability, prior trauma, education access, substance use history, and mental health with an overall annotator agreement of 81.9 F1. Our proposed fine-tuning LLM-based extractors achieve high performance at 78.4 F1 for event arguments. In-context learning approaches with GPT-4 demonstrate promise for reliable SDoH extraction with limited annotated examples, with extraction performance at 82.3 F1 for event triggers.

Extracting Social Determinants of Health from Pediatric Patient Notes Using Large Language Models: Novel Corpus and Methods

TL;DR

This study addresses extracting pediatric social determinants of health (SDoH) from clinical notes by introducing PedSHAC, a large annotated corpus with ten SDoH event types. It systematically evaluates both fine-tuning transformer models (mSpERT, T5) and in-context learning with GPT-4, using one-step and two-step prompting strategies. The results show that SDoH representations can be extracted with high accuracy, approaching human annotator agreement (e.g., 78.4 F1 for event arguments and 82.3 F1 for triggers with GPT-4 in-context), enabling direct integration into structured EHR data. The work highlights trade-offs between fine-tuning and prompt-based methods, underscores the potential for real-time SDoH extraction in pediatrics, and outlines avenues for expanding the corpus and refining prompts and evaluation.

Abstract

Social determinants of health (SDoH) play a critical role in shaping health outcomes, particularly in pediatric populations where interventions can have long-term implications. SDoH are frequently studied in the Electronic Health Record (EHR), which provides a rich repository for diverse patient data. In this work, we present a novel annotated corpus, the Pediatric Social History Annotation Corpus (PedSHAC), and evaluate the automatic extraction of detailed SDoH representations using fine-tuned and in-context learning methods with Large Language Models (LLMs). PedSHAC comprises annotated social history sections from 1,260 clinical notes obtained from pediatric patients within the University of Washington (UW) hospital system. Employing an event-based annotation scheme, PedSHAC captures ten distinct health determinants to encompass living and economic stability, prior trauma, education access, substance use history, and mental health with an overall annotator agreement of 81.9 F1. Our proposed fine-tuning LLM-based extractors achieve high performance at 78.4 F1 for event arguments. In-context learning approaches with GPT-4 demonstrate promise for reliable SDoH extraction with limited annotated examples, with extraction performance at 82.3 F1 for event triggers.
Paper Structure (19 sections, 2 figures, 3 tables)

This paper contains 19 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An Annotation example: the triggers are in boldface. The box above a trigger shows the event type, arguments and subtype labels.
  • Figure 2: Our one-step (T5-Event) and two-step (T5-2sQA) extraction models. T5-Event extracts all SDoH events, including triggers and arguments, in one query. T5-2sQA extracts triggers and arguments in separate queries, where Step Two includes the predicted triggers from Step One.