Table of Contents
Fetching ...

Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

Douwe J. Spaanderman, Karthik Prathaban, Petr Zelina, Kaouther Mouheb, Lukáš Hejtmánek, Matthew Marzetti, Antonius W. Schurink, Damian Chan, Ruben Niemantsverdriet, Frederik Hartmann, Zhen Qian, Maarten G. J. Thomeer, Petr Holub, Farhan Akram, Frank J. Wolters, Meike W. Vernooij, Cornelis Verhoef, Esther E. Bron, Vít Nováček, Dirk J. Grünhagen, Wiro J. Niessen, Martijn P. A. Starmans, Stefan Klein

TL;DR

This study provides a systematic, multilingual benchmarking of 15 open-weight LLMs for structured data extraction from narrative pathology and radiology reports across six clinical use cases and three languages, using six prompting strategies. It shows that small-to-medium general-purpose models can match larger models, especially when using prompt graphs or few-shot prompts, and that task complexity and annotation quality largely govern performance. The results approach inter-rater agreement on several fields and reveal that language differences reflect annotation variability more than model limitations. The work offers a scalable, privacy-preserving framework for clinical data curation and promotes reproducible benchmarking for open-weight LLMs in healthcare.

Abstract

Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.

Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

TL;DR

This study provides a systematic, multilingual benchmarking of 15 open-weight LLMs for structured data extraction from narrative pathology and radiology reports across six clinical use cases and three languages, using six prompting strategies. It shows that small-to-medium general-purpose models can match larger models, especially when using prompt graphs or few-shot prompts, and that task complexity and annotation quality largely govern performance. The results approach inter-rater agreement on several fields and reveal that language differences reflect annotation variability more than model limitations. The work offers a scalable, privacy-preserving framework for clinical data curation and promotes reproducible benchmarking for open-weight LLMs in healthcare.

Abstract

Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.

Paper Structure

This paper contains 65 sections, 9 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Study design. (A) Clinical use cases across different diseases, report languages, and extraction targets. (B) Evaluation pipeline: medical reports are processed through different prompting strategies (zero-shot, few-shot, chain-of-thought, self-consistency, and prompt graphs) defined by configuration files in YAML format (a human-readable format for structured data). Model outputs are converted into JSON (JavaScript Object Notation) files, a widely used format for structured data.
  • Figure 2: Performance of LLMs across clinical information extraction tasks. Results are shown for large, medium, small, tiny, and specialised models on six tasks in multiple languages, evaluated with different prompting strategies. Bars represent macro-average scores with error bars indicating 95% confidence intervals from bootstrapping. For each model and task, zero shot prompting (diagonal hatching) is included alongside the best performing strategy determined by Kemeny–Young rank aggregation, which is indicated with the corresponding marker. Where available, inter-rater agreement values are displayed to provide a human benchmark for model performance.
  • Figure 3: Kemeny-Young aggregated ranking of LLMs across clinical information extraction tasks. For each LLM, only the rank corresponding to its best-performing prompting strategy is reported. The final rank represents the consensus ordering obtained via the Kemeny-Young method across all evaluated use cases.
  • Figure 4: Model performance versus per-GPU throughput over all use cases. The y-axis shows the mean macro-average score across use cases, calculated using Few-Shot prompting only. Per-GPU throughput is defined as the number of tokens processed per second, normalized by the number of GPUs used to allow fair comparison across models. Bubble size represents model parameter size and colour indicates model category. Left: measured throughput; right: potential throughput per GPU corrected for cache utilisation (accounting for how much of the model’s cached key-value memory is effectively used). The Sarcoma use case was excluded from this analysis since it was performed on different hardware.
  • Figure A1: Directed acyclic graph showing sequential extraction order of variable extraction for colorectal liver metastases use case.
  • ...and 12 more figures