CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports
Xiao Yu Cindy Zhang, Carlos R. Ferreira, Francis Rossignol, Raymond T. Ng, Wyeth Wasserman, Jian Zhu
TL;DR
CaseReportBench provides a publicly available, expert-annotated benchmark for dense information extraction from clinical case reports focusing on Inborn Errors of Metabolism. The study benchmarks five LLMs using three data integration methods (FCSP, UCP, UGP) and three prompting strategies (ZS, FS, ZS-CoT), finding that category-specific prompting improves alignment while open-source models can surpass GPT-4o in this task. FCSP offers a favorable balance of accuracy and efficiency; larger model size does not guarantee better performance due to alignment and instruction fidelity concerns. Clinician evaluations show promise for LLM-assisted extraction to support diagnosis and workflow, while highlighting the need for expert oversight and highlighting an approximate 24-hour reduction in manual annotation time across 138 cases. Overall, CaseReportBench advances clinical NLP by enabling scalable, structured extraction from non-EHR clinical narratives and setting a foundation for broader disease-domain expansion.
Abstract
Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports, focusing on IEMs. Using this dataset, we assess various models and prompting strategies, introducing novel approaches such as category-specific prompting and subheading-filtered data integration. Zero-shot chain-of-thought prompting offers little advantage over standard zero-shot prompting. Category-specific prompting improves alignment with the benchmark. The open-source model Qwen2.5-7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings important for differential diagnosis. This work advances LLM-driven clinical natural language processing and paves the way for scalable medical AI applications.
