Table of Contents
Fetching ...

Leveraging Open-Source Large Language Models for encoding Social Determinants of Health using an Intelligent Router

Akul Goel, Surya Narayanan Hari, Belinda Waltman, Matt Thomson

TL;DR

An intelligent routing system for SDOH coding that uses a language model router to direct medical record data to open-source LLMs that demonstrate optimal performance on specific SDOH codes is introduced.

Abstract

Social Determinants of Health (SDOH), also known as Health-Related Social Needs (HSRN), play a significant role in patient health outcomes. The Centers for Disease Control and Prevention (CDC) introduced a subset of ICD-10 codes called Z-codes to recognize and measure SDOH. However, Z-codes are infrequently coded in a patient's Electronic Health Record (EHR), and instead, in many cases, need to be inferred from clinical notes. Previous research has shown that large language models (LLMs) show promise on extracting unstructured data from EHRs, but it can be difficult to identify a single model that performs best on varied coding tasks. Further, clinical notes contain protected health information posing a challenge for the use of closed-source language models from commercial vendors. The identification of open-source LLMs that can be run within health organizations and exhibit high performance on SDOH tasks is an important issue to solve. Here, we introduce an intelligent routing system for SDOH coding that uses a language model router to direct medical record data to open-source LLMs that demonstrate optimal performance on specific SDOH codes. This intelligent routing system exhibits state of the art performance of 96.4% accuracy averaged across 13 codes, including homelessness and food insecurity, outperforming closed models such as GPT-4o. We leveraged a publicly-available, deidentified dataset of medical record notes to run the router, but we also introduce a synthetic data generation and validation paradigm to increase the scale of training data without needing privacy-protected medical records. Together, we demonstrate an architecture for intelligent routing of inputs to task-optimal language models to achieve high performance across a set of medical coding sub-tasks.

Leveraging Open-Source Large Language Models for encoding Social Determinants of Health using an Intelligent Router

TL;DR

An intelligent routing system for SDOH coding that uses a language model router to direct medical record data to open-source LLMs that demonstrate optimal performance on specific SDOH codes is introduced.

Abstract

Social Determinants of Health (SDOH), also known as Health-Related Social Needs (HSRN), play a significant role in patient health outcomes. The Centers for Disease Control and Prevention (CDC) introduced a subset of ICD-10 codes called Z-codes to recognize and measure SDOH. However, Z-codes are infrequently coded in a patient's Electronic Health Record (EHR), and instead, in many cases, need to be inferred from clinical notes. Previous research has shown that large language models (LLMs) show promise on extracting unstructured data from EHRs, but it can be difficult to identify a single model that performs best on varied coding tasks. Further, clinical notes contain protected health information posing a challenge for the use of closed-source language models from commercial vendors. The identification of open-source LLMs that can be run within health organizations and exhibit high performance on SDOH tasks is an important issue to solve. Here, we introduce an intelligent routing system for SDOH coding that uses a language model router to direct medical record data to open-source LLMs that demonstrate optimal performance on specific SDOH codes. This intelligent routing system exhibits state of the art performance of 96.4% accuracy averaged across 13 codes, including homelessness and food insecurity, outperforming closed models such as GPT-4o. We leveraged a publicly-available, deidentified dataset of medical record notes to run the router, but we also introduce a synthetic data generation and validation paradigm to increase the scale of training data without needing privacy-protected medical records. Together, we demonstrate an architecture for intelligent routing of inputs to task-optimal language models to achieve high performance across a set of medical coding sub-tasks.
Paper Structure (4 sections, 1 equation, 12 figures)

This paper contains 4 sections, 1 equation, 12 figures.

Figures (12)

  • Figure 1: Models were prompted with a sentence from the medical note to classify. a) The above figure shows the prompt template used to prompt all models on all SDOH codes. The specific SDOH we were aiming to classify was input into the SDOH keyword or phrase (green) field, and a sentence from the medical note the model was to classify was placed in the SENTENCE (blue) field. b) This figure shows a specific example of a prompt for SDOH code 'imprisonment or other incarceration' (green), with a sentence (blue) from the medical note corresponding to it.
  • Figure 2: Claude-3-opus generated and verified synthetic data. Figure shows flow chart of synthetic data generation and verification by Claude-3-opus. Random sentences corresponding to a specific SDOH code were given as examples from labeled medical notes. Based on these examples, Claude was asked to generate synthetic sentences via two different prompts, one explicitly asking the model to not use the SDOH keyword for complexity. Each synthetic sentence was subsequently checked by Claude for evidence of the specific SDOH code and sentences that did not pass were discarded. Sentences that passed the verification were added to an SDOH’s synthetic data collection. Random sentences for examples were picked from 500 labeled medical notes from the MIMIC-III dataset johnson_mimic-iii_2015.
  • Figure 3: Test Set displayed approximately 33%/67% splits for each code. Table shows data distribution across 5 SDOH codes, with various numbers of gold data (sentences from medical note containing SDOH code), synthetic data (sentences generated and validated by Claude in Fig. \ref{['fig:synthetic_data']} containing SDOH code), and negative data (random subset of sentences from medical note not containing SDOH code). This validation set ultimately contained approximately 33% positive label (defined as gold + synthetic data) and 67% negative label (negative data) splits for each SDOH task, with around 1000 total sentences for each code. Gold data and negative data were sentences pulled from 500 labeled medical notes from the MIMIC-III dataset johnson_mimic-iii_2015.
  • Figure 4: Verification scheme resulted in Claude dropping synthetic data. Table for 5 SDOH codes showing numbers of synthetic data originally dropped by Claude in scheme shown in Fig. \ref{['fig:synthetic_data']}. Data distributions after “Claude Dropped” is shown in Fig. \ref{['figure:data_distributions']} and was the dataset used to generate Figs. \ref{['figure:comparison_plot']}
  • Figure 5: Router identifies best model for given SDOH code. Router model architecture. Oracle router takes in keyword or phrase from given SDOH and outputs best model for the task. Figure shows color coded arrow from SDOH code to the model chosen by the router (ie, for food insecurity, Zero-one-ai/Yi-34B-Chat is considered the best model).
  • ...and 7 more figures