Table of Contents
Fetching ...

emrQA: A Large Corpus for Question Answering on Electronic Medical Records

Anusri Pampari, Preethi Raghavan, Jennifer Liang, Jian Peng

TL;DR

This work tackles the scarcity of large-scale QA data for electronic medical records by introducing a novel framework that re-purposes existing i2b2 annotations to generate a large EMR QA corpus (emrQA) with 400k QA pairs and 1M question-logical form pairs. The method creates domain-specific question templates grounded in medical ontologies and links them to executable logical forms, enabling interpretable reasoning tasks, including temporal and arithmetic reasoning. Extensive analysis reveals high paraphrase diversity, long and complex evidence traces across longitudinal notes, and substantial reasoning demands, with baseline Q-L and Q-A models illustrating current gaps and the need for hybrid, interpretable approaches. The framework is designed for broad applicability beyond EMRs and can be extended to other domains and datasets (e.g., MIMIC, DBPedia), potentially transforming the scale and interpretability of domain-specific QA research.

Abstract

We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million question-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping.

emrQA: A Large Corpus for Question Answering on Electronic Medical Records

TL;DR

This work tackles the scarcity of large-scale QA data for electronic medical records by introducing a novel framework that re-purposes existing i2b2 annotations to generate a large EMR QA corpus (emrQA) with 400k QA pairs and 1M question-logical form pairs. The method creates domain-specific question templates grounded in medical ontologies and links them to executable logical forms, enabling interpretable reasoning tasks, including temporal and arithmetic reasoning. Extensive analysis reveals high paraphrase diversity, long and complex evidence traces across longitudinal notes, and substantial reasoning demands, with baseline Q-L and Q-A models illustrating current gaps and the need for hybrid, interpretable approaches. The framework is designed for broad applicability beyond EMRs and can be extended to other domains and datasets (e.g., MIMIC, DBPedia), potentially transforming the scale and interpretability of domain-specific QA research.

Abstract

We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million question-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping.

Paper Structure

This paper contains 20 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Question-Answer pairs from emrQA clinical note.
  • Figure 2: Our QA dataset generation framework using existing i2b2 annotations on a given patient's record to generate a question, its logical form and answer evidence. The highlights in the figure show the annotations being used for this example.
  • Figure 3: Events, attributes & relations in emrQA's logical forms. Events & attributes accept i2b2 entities as arguments.