Table of Contents
Fetching ...

Combining Data Generation and Active Learning for Low-Resource Question Answering

Maximilian Kimmich, Andrea Bartezzaghi, Jasmina Bogojeska, Cristiano Malossi, Ngoc Thang Vu

TL;DR

This work tackles data scarcity in machine reading QA (MRQA) by combining data augmentation via QA generation with Active Learning (AL) to build effective QA systems in low-resource, domain-diverse settings. The authors train a QA2S data-generation model with AL to create targeted synthetic data, then fine-tune a MRQA model on generated data and a small set of real target-domain labels, guided by a novel Round-trip (RT) scoring that ties the downstream MRQA task back to data generation. Across source SQuAD and target domains TechQA and BioASQ (and Natural Questions) the approach yields consistent improvements over baselines that use either random labeling or AL on MRQA alone, with notable gains when data-generation AL is employed. These results demonstrate a practical pathway to reduce labeling effort while achieving robust extractive QA performance in specialized domains, enabling scalable deployment of QA systems in low-resource contexts.

Abstract

Neural approaches have become very popular in Question Answering (QA), however, they require a large amount of annotated data. In this work, we propose a novel approach that combines data augmentation via question-answer generation with Active Learning to improve performance in low-resource settings, where the target domains are diverse in terms of difficulty and similarity to the source domain. We also investigate Active Learning for question answering in different stages, overall reducing the annotation effort of humans. For this purpose, we consider target domains in realistic settings, with an extremely low amount of annotated samples but with many unlabeled documents, which we assume can be obtained with little effort. Additionally, we assume a sufficient amount of labeled data from the source domain being available. We perform extensive experiments to find the best setup for incorporating domain experts. Our findings show that our novel approach, where humans are incorporated in a data generation approach, boosts performance in the low-resource, domain-specific setting, allowing for low-labeling-effort question answering systems in new, specialized domains. They further demonstrate how human annotation affects the performance of QA depending on the stage it is performed.

Combining Data Generation and Active Learning for Low-Resource Question Answering

TL;DR

This work tackles data scarcity in machine reading QA (MRQA) by combining data augmentation via QA generation with Active Learning (AL) to build effective QA systems in low-resource, domain-diverse settings. The authors train a QA2S data-generation model with AL to create targeted synthetic data, then fine-tune a MRQA model on generated data and a small set of real target-domain labels, guided by a novel Round-trip (RT) scoring that ties the downstream MRQA task back to data generation. Across source SQuAD and target domains TechQA and BioASQ (and Natural Questions) the approach yields consistent improvements over baselines that use either random labeling or AL on MRQA alone, with notable gains when data-generation AL is employed. These results demonstrate a practical pathway to reduce labeling effort while achieving robust extractive QA performance in specialized domains, enabling scalable deployment of QA systems in low-resource contexts.

Abstract

Neural approaches have become very popular in Question Answering (QA), however, they require a large amount of annotated data. In this work, we propose a novel approach that combines data augmentation via question-answer generation with Active Learning to improve performance in low-resource settings, where the target domains are diverse in terms of difficulty and similarity to the source domain. We also investigate Active Learning for question answering in different stages, overall reducing the annotation effort of humans. For this purpose, we consider target domains in realistic settings, with an extremely low amount of annotated samples but with many unlabeled documents, which we assume can be obtained with little effort. Additionally, we assume a sufficient amount of labeled data from the source domain being available. We perform extensive experiments to find the best setup for incorporating domain experts. Our findings show that our novel approach, where humans are incorporated in a data generation approach, boosts performance in the low-resource, domain-specific setting, allowing for low-labeling-effort question answering systems in new, specialized domains. They further demonstrate how human annotation affects the performance of QA depending on the stage it is performed.
Paper Structure (31 sections, 4 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 31 sections, 4 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Our approach combining Active Learning with data generation: In a first step, the data generation model is efficiently trained using Active Learning. Second, this model is then used to generate data for MRQA.
  • Figure 2: Sample score distribution for TechQA: RT scores many samples low, but surprisingly also rates some samples high, although the task of predicting the generated answer for a generated question is complex. Scores have been rescaled to $\left[0,1\right]$ per scoring function and iteration to better compare distributions.
  • Figure 3: Visualization of the BioASQ dataset samples using representations retrieved using the MRQA model, with samples selected by RT in the last iteration (221 instances) marked red. Selected samples are well distributed among all samples suggesting that a diverse set of samples is selected.