Table of Contents
Fetching ...

A Question Answering Based Pipeline for Comprehensive Chinese EHR Information Extraction

Huaiyuan Ying, Sheng Yu

TL;DR

A novel approach that automatically generates training data for transfer learning of QA models that exhibits excellent performance on subtasks of information extraction in EHRs, and it can effectively handle few-shot or zero-shot settings involving yes-no questions.

Abstract

Electronic health records (EHRs) hold significant value for research and applications. As a new way of information extraction, question answering (QA) can extract more flexible information than conventional methods and is more accessible to clinical researchers, but its progress is impeded by the scarcity of annotated data. In this paper, we propose a novel approach that automatically generates training data for transfer learning of QA models. Our pipeline incorporates a preprocessing module to handle challenges posed by extraction types that are not readily compatible with extractive QA frameworks, including cases with discontinuous answers and many-to-one relationships. The obtained QA model exhibits excellent performance on subtasks of information extraction in EHRs, and it can effectively handle few-shot or zero-shot settings involving yes-no questions. Case studies and ablation studies demonstrate the necessity of each component in our design, and the resulting model is deemed suitable for practical use.

A Question Answering Based Pipeline for Comprehensive Chinese EHR Information Extraction

TL;DR

A novel approach that automatically generates training data for transfer learning of QA models that exhibits excellent performance on subtasks of information extraction in EHRs, and it can effectively handle few-shot or zero-shot settings involving yes-no questions.

Abstract

Electronic health records (EHRs) hold significant value for research and applications. As a new way of information extraction, question answering (QA) can extract more flexible information than conventional methods and is more accessible to clinical researchers, but its progress is impeded by the scarcity of annotated data. In this paper, we propose a novel approach that automatically generates training data for transfer learning of QA models. Our pipeline incorporates a preprocessing module to handle challenges posed by extraction types that are not readily compatible with extractive QA frameworks, including cases with discontinuous answers and many-to-one relationships. The obtained QA model exhibits excellent performance on subtasks of information extraction in EHRs, and it can effectively handle few-shot or zero-shot settings involving yes-no questions. Case studies and ablation studies demonstrate the necessity of each component in our design, and the resulting model is deemed suitable for practical use.
Paper Structure (22 sections, 4 equations, 3 figures, 6 tables)

This paper contains 22 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The architecture of our pipeline. From the EHR corpus, we obtain the original dependency annotations and the context. The words labeled by colors are the general entities annotated, each color stands for one type. The arrow represents the dependency annotations. During preprocessing, the dependency annotations are transformed into questions through manually constructed templates based on relation types. The contexts are split according to many-to-one correspondences of the relation pairs, resulting in sentence-level or paragraph-level texts. The questions and the texts are concatenated and sent into the QA model for training. We also introduce impossible questions with plausible answers through annotations of the same type. The QA model judges the answerability of each question-context pairs and output the answer span. Finally, the answers from split texts are merged to provide the final outputs.
  • Figure 2: The Retro-reader model contains two reading modules and a rear verification module. The scores produced by reading modules will be compared to a threshold to decide whether the question is answerable.
  • Figure 3: The examples for translated gold annotations and predictions of different models. The colored words in the context are the gold answer, and the red word is where the boundary mismatches. Note that translation may omit some expressions in Chinese, so the examples are for reference only and may not reflect the whole picture of Chinese EHR texts.