Table of Contents
Fetching ...

Diagnosing our datasets: How does my language model learn clinical information?

Furong Jia, David Sontag, Monica Agrawal

TL;DR

This study investigates how open-source LLMs acquire clinical information from public data, focusing on two core facets: clinical jargon understanding and the propagation of unsupported medical claims. It introduces MedLingo, a targeted jargon benchmark, and analyzes pretraining-data frequency using the WIMBD framework to connect jargon learning with corpus composition, finding that real-world jargon usage often exceeds what is captured in open corpora. The work also examines how online data sources contribute to both clinical jargon and disputed claims, showing that while peer-reviewed sources are common, informal and commercial sources substantially shape model outputs, especially for contested claims. The findings highlight a need for careful data curation, robust evaluation for misinformation, and safeguards during inference to ensure safe deployment of medical LLMs.

Abstract

Large language models (LLMs) have performed well across various clinical natural language processing tasks, despite not being directly trained on electronic health record (EHR) data. In this work, we examine how popular open-source LLMs learn clinical information from large mined corpora through two crucial but understudied lenses: (1) their interpretation of clinical jargon, a foundational ability for understanding real-world clinical notes, and (2) their responses to unsupported medical claims. For both use cases, we investigate the frequency of relevant clinical information in their corresponding pretraining corpora, the relationship between pretraining data composition and model outputs, and the sources underlying this data. To isolate clinical jargon understanding, we evaluate LLMs on a new dataset MedLingo. Unsurprisingly, we find that the frequency of clinical jargon mentions across major pretraining corpora correlates with model performance. However, jargon frequently appearing in clinical notes often rarely appears in pretraining corpora, revealing a mismatch between available data and real-world usage. Similarly, we find that a non-negligible portion of documents support disputed claims that can then be parroted by models. Finally, we classified and analyzed the types of online sources in which clinical jargon and unsupported medical claims appear, with implications for future dataset composition.

Diagnosing our datasets: How does my language model learn clinical information?

TL;DR

This study investigates how open-source LLMs acquire clinical information from public data, focusing on two core facets: clinical jargon understanding and the propagation of unsupported medical claims. It introduces MedLingo, a targeted jargon benchmark, and analyzes pretraining-data frequency using the WIMBD framework to connect jargon learning with corpus composition, finding that real-world jargon usage often exceeds what is captured in open corpora. The work also examines how online data sources contribute to both clinical jargon and disputed claims, showing that while peer-reviewed sources are common, informal and commercial sources substantially shape model outputs, especially for contested claims. The findings highlight a need for careful data curation, robust evaluation for misinformation, and safeguards during inference to ensure safe deployment of medical LLMs.

Abstract

Large language models (LLMs) have performed well across various clinical natural language processing tasks, despite not being directly trained on electronic health record (EHR) data. In this work, we examine how popular open-source LLMs learn clinical information from large mined corpora through two crucial but understudied lenses: (1) their interpretation of clinical jargon, a foundational ability for understanding real-world clinical notes, and (2) their responses to unsupported medical claims. For both use cases, we investigate the frequency of relevant clinical information in their corresponding pretraining corpora, the relationship between pretraining data composition and model outputs, and the sources underlying this data. To isolate clinical jargon understanding, we evaluate LLMs on a new dataset MedLingo. Unsurprisingly, we find that the frequency of clinical jargon mentions across major pretraining corpora correlates with model performance. However, jargon frequently appearing in clinical notes often rarely appears in pretraining corpora, revealing a mismatch between available data and real-world usage. Similarly, we find that a non-negligible portion of documents support disputed claims that can then be parroted by models. Finally, we classified and analyzed the types of online sources in which clinical jargon and unsupported medical claims appear, with implications for future dataset composition.

Paper Structure

This paper contains 44 sections, 9 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: An overview of our analysis: 1) Benchmarking models on their knowledge of the clinical jargon and debunked medical claims. 2) Estimating the prevalence of clinical keywords in the pretraining corpora and examining its correlation with model performance, and 3) Investigating the sources of clinical data in pretraining corpora, both for jargon and unsupported medical claims.
  • Figure 2: Example of the difference between language in clinical notes vs. benchmarks.
  • Figure 3: OLMo accuracy vs. Dolma estimated co-occurrence frequency on CASI dataset. Each dot shows a jargon-expansion pair.
  • Figure 4: Estimated frequency of jargon in the Dolma dataset vs. in MIMIC-IV Notes
  • Figure 5: Source classification for CASI, MedLingo, and the documents supporting disputed medical claims.
  • ...and 8 more figures