Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

Douwe J. Spaanderman; Karthik Prathaban; Petr Zelina; Kaouther Mouheb; Lukáš Hejtmánek; Matthew Marzetti; Antonius W. Schurink; Damian Chan; Ruben Niemantsverdriet; Frederik Hartmann; Zhen Qian; Maarten G. J. Thomeer; Petr Holub; Farhan Akram; Frank J. Wolters; Meike W. Vernooij; Cornelis Verhoef; Esther E. Bron; Vít Nováček; Dirk J. Grünhagen; Wiro J. Niessen; Martijn P. A. Starmans; Stefan Klein

Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

Douwe J. Spaanderman, Karthik Prathaban, Petr Zelina, Kaouther Mouheb, Lukáš Hejtmánek, Matthew Marzetti, Antonius W. Schurink, Damian Chan, Ruben Niemantsverdriet, Frederik Hartmann, Zhen Qian, Maarten G. J. Thomeer, Petr Holub, Farhan Akram, Frank J. Wolters, Meike W. Vernooij, Cornelis Verhoef, Esther E. Bron, Vít Nováček, Dirk J. Grünhagen, Wiro J. Niessen, Martijn P. A. Starmans, Stefan Klein

TL;DR

This study provides a systematic, multilingual benchmarking of 15 open-weight LLMs for structured data extraction from narrative pathology and radiology reports across six clinical use cases and three languages, using six prompting strategies. It shows that small-to-medium general-purpose models can match larger models, especially when using prompt graphs or few-shot prompts, and that task complexity and annotation quality largely govern performance. The results approach inter-rater agreement on several fields and reveal that language differences reflect annotation variability more than model limitations. The work offers a scalable, privacy-preserving framework for clinical data curation and promotes reproducible benchmarking for open-weight LLMs in healthcare.

Abstract

Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.

Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

TL;DR

Abstract

Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)