emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

Jimenez Eladio; Hao Wu

emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

Jimenez Eladio, Hao Wu

TL;DR

The paper addresses the challenge of medical question answering by introducing emrQA-msquad, a dataset that casts EMR-derived medical content into the SQuAD v2.0 span-extraction framework to improve QA performance in the medical domain. It builds on the EmrQA resource and SQuAD v2.0 benchmark, detailing a design pipeline that restructures unstructured EMR reports, uses a large-language-model-based summarization approach augmented by manual ground-truth curation, and yields a dataset with 253 contexts, 163,695 questions, and 4,136 answers. The authors show substantial gains from domain-specific fine-tuning of BERT, RoBERTa, and Tiny RoBERTa over their SQuAD-only baselines, with improvements in Exact Match and F1 scores and favorable shifts in prediction distributions. The emrQA-msquad resource is publicly available on HuggingFace, offering a valuable, structured medical QA benchmark for future research and practical clinical QA applications.

Abstract

Machine Reading Comprehension (MRC) holds a pivotal role in shaping Medical Question Answering Systems (QAS) and transforming the landscape of accessing and applying medical information. However, the inherent challenges in the medical field, such as complex terminology and question ambiguity, necessitate innovative solutions. One key solution involves integrating specialized medical datasets and creating dedicated datasets. This strategic approach enhances the accuracy of QAS, contributing to advancements in clinical decision-making and medical research. To address the intricacies of medical terminology, a specialized dataset was integrated, exemplified by a novel Span extraction dataset derived from emrQA but restructured into 163,695 questions and 4,136 manually obtained answers, this new dataset was called emrQA-msquad dataset. Additionally, for ambiguous questions, a dedicated medical dataset for the Span extraction task was introduced, reinforcing the system's robustness. The fine-tuning of models such as BERT, RoBERTa, and Tiny RoBERTa for medical contexts significantly improved response accuracy within the F1-score range of 0.75 to 1.00 from 10.1% to 37.4%, 18.7% to 44.7% and 16.0% to 46.8%, respectively. Finally, emrQA-msquad dataset is publicy available at https://huggingface.co/datasets/Eladio/emrqa-msquad.

emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

TL;DR

Abstract

emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

Authors

TL;DR

Abstract

Table of Contents

Figures (7)