Table of Contents
Fetching ...

LLM-Based Section Identifiers Excel on Open Source but Stumble in Real World Applications

Saranya Krishnamoorthy, Ayush Singh, Shabnam Tafreshi

TL;DR

This study investigates the feasibility of using large language models, notably GPT-4, to identify semantically relevant sections in electronic health records. It shows that GPT-4 delivers near state-of-the-art performance on benchmark datasets (i2b2 2010 and MedSecID) in zero- and few-shot settings, but encounters a substantial performance drop on a harder real-world, OCR-noised dataset, highlighting the gap between clean benchmarks and practical EHRs. The authors provide a new real-world benchmark with a comprehensive taxonomy and analyze factors contributing to the drop, including header variability and data quality, while comparing LLMs to open-source alternatives. They conclude that unsupervised LLM approaches are powerful for clean data, but robust real-world SI requires improved benchmarks, data quality, and potentially synthetic data to bridge the gap. The work offers practical guidance for deploying SI in clinical NLP and contributes a taxonomy and evaluation framework for future research.

Abstract

Electronic health records (EHR) even though a boon for healthcare practitioners, are growing convoluted and longer every day. Sifting around these lengthy EHRs is taxing and becomes a cumbersome part of physician-patient interaction. Several approaches have been proposed to help alleviate this prevalent issue either via summarization or sectioning, however, only a few approaches have truly been helpful in the past. With the rise of automated methods, machine learning (ML) has shown promise in solving the task of identifying relevant sections in EHR. However, most ML methods rely on labeled data which is difficult to get in healthcare. Large language models (LLMs) on the other hand, have performed impressive feats in natural language processing (NLP), that too in a zero-shot manner, i.e. without any labeled data. To that end, we propose using LLMs to identify relevant section headers. We find that GPT-4 can effectively solve the task on both zero and few-shot settings as well as segment dramatically better than state-of-the-art methods. Additionally, we also annotate a much harder real world dataset and find that GPT-4 struggles to perform well, alluding to further research and harder benchmarks.

LLM-Based Section Identifiers Excel on Open Source but Stumble in Real World Applications

TL;DR

This study investigates the feasibility of using large language models, notably GPT-4, to identify semantically relevant sections in electronic health records. It shows that GPT-4 delivers near state-of-the-art performance on benchmark datasets (i2b2 2010 and MedSecID) in zero- and few-shot settings, but encounters a substantial performance drop on a harder real-world, OCR-noised dataset, highlighting the gap between clean benchmarks and practical EHRs. The authors provide a new real-world benchmark with a comprehensive taxonomy and analyze factors contributing to the drop, including header variability and data quality, while comparing LLMs to open-source alternatives. They conclude that unsupervised LLM approaches are powerful for clean data, but robust real-world SI requires improved benchmarks, data quality, and potentially synthetic data to bridge the gap. The work offers practical guidance for deploying SI in clinical NLP and contributes a taxonomy and evaluation framework for future research.

Abstract

Electronic health records (EHR) even though a boon for healthcare practitioners, are growing convoluted and longer every day. Sifting around these lengthy EHRs is taxing and becomes a cumbersome part of physician-patient interaction. Several approaches have been proposed to help alleviate this prevalent issue either via summarization or sectioning, however, only a few approaches have truly been helpful in the past. With the rise of automated methods, machine learning (ML) has shown promise in solving the task of identifying relevant sections in EHR. However, most ML methods rely on labeled data which is difficult to get in healthcare. Large language models (LLMs) on the other hand, have performed impressive feats in natural language processing (NLP), that too in a zero-shot manner, i.e. without any labeled data. To that end, we propose using LLMs to identify relevant section headers. We find that GPT-4 can effectively solve the task on both zero and few-shot settings as well as segment dramatically better than state-of-the-art methods. Additionally, we also annotate a much harder real world dataset and find that GPT-4 struggles to perform well, alluding to further research and harder benchmarks.
Paper Structure (17 sections, 6 figures, 8 tables)

This paper contains 17 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Sample real world obscure image of an outpatient paper-based patient encounter form comprising of numerous sections hersh2018health.
  • Figure 2: Section categories which are selected based on observation of top-header sections in the corpus and human judgment to associate section names to their topic or category of representations.
  • Figure 3: Basic Prompt Template
  • Figure 4: One Shot Prompt: provide examples of segmentation as well as provide a seed list of headings found in MedSecId.
  • Figure 5: CoT Prompt: make the LLM think rationally and try to extract all possible section headers in the clinical notes
  • ...and 1 more figures