Table of Contents
Fetching ...

High Throughput Phenotyping of Physician Notes with Large Language and Hybrid NLP Models

Syed I. Munzir, Daniel B. Hier, Michael D. Carrithers

TL;DR

The paper addresses the need for high-throughput deep phenotyping of physician notes in electronic health records to support precision medicine. It compares NimbleMiner, a hybrid NLP method using word embeddings and a support vector machine, with GPT-4, a general-purpose large language model, for phenotyping 547 multiple sclerosis notes across 19 neurological phenotype categories, with ground-truth labels from Prodigy. Both approaches achieve high accuracy (0.87 for NimbleMiner and 0.85 for GPT-4), nearing the human inter-annotator agreement ceiling (κ ≈ 0.90), with GPT-4 offering easy configuration and no training data, and NimbleMiner providing transparency and fast recall with proper lexicon design. The findings suggest LLMs may become the dominant method for high-throughput deep phenotyping in clinical notes, though broader validation on diverse corpora is needed to assess generalizability and computational costs.

Abstract

Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past thirty years, progress toward making high throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping of physician notes.

High Throughput Phenotyping of Physician Notes with Large Language and Hybrid NLP Models

TL;DR

The paper addresses the need for high-throughput deep phenotyping of physician notes in electronic health records to support precision medicine. It compares NimbleMiner, a hybrid NLP method using word embeddings and a support vector machine, with GPT-4, a general-purpose large language model, for phenotyping 547 multiple sclerosis notes across 19 neurological phenotype categories, with ground-truth labels from Prodigy. Both approaches achieve high accuracy (0.87 for NimbleMiner and 0.85 for GPT-4), nearing the human inter-annotator agreement ceiling (κ ≈ 0.90), with GPT-4 offering easy configuration and no training data, and NimbleMiner providing transparency and fast recall with proper lexicon design. The findings suggest LLMs may become the dominant method for high-throughput deep phenotyping in clinical notes, though broader validation on diverse corpora is needed to assess generalizability and computational costs.

Abstract

Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past thirty years, progress toward making high throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping of physician notes.
Paper Structure (4 sections, 3 figures, 1 table)

This paper contains 4 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Bar chart showing the frequency of each sign or symptom in the test set of patient notes. All patient notes had a diagnosis of multiple sclerosis (ICD-10 code G35). Each sign or symptom was coded binary (present or absent) regardless of the number of occurrences in the note. The most common signs and symptoms were weakness, paresthesias, pain, and gait.
  • Figure 2: Annotation screen for Prodigy using the manual.spancat recipe to label text spans (Explosion AI). The annotator has the choice of 19 neurological phenotype labels and has chosen paresthesias to label text span burning sensation in their feet.
  • Figure 3: Simclin Explorer screen for NimbleMiner for the annotation of the feature behavior. The blue highlighted row indicates a simclin selected for inclusion in the final simclins list. The gray rows represent terms that were marked irrelevant.