Table of Contents
Fetching ...

Prompting-based Synthetic Data Generation for Few-Shot Question Answering

Maximilian Schmidt, Andrea Bartezzaghi, Ngoc Thang Vu

TL;DR

The paper addresses data scarcity in extractive MRQA by proposing prompting-based synthetic data generation that leverages the linguistic knowledge of large language models. It introduces a two-stage pipeline: sample candidate answers via NER from contexts, then generate questions conditioned on the context and answer using a seq2seq LM modeled as $p(q|c,a_c)$, followed by training a MRQA model on synthetic data plus any available labels. The approach yields competitive or superior performance in few-shot settings across multiple datasets, with notable zero-shot capabilities and, in some cases, parity with full-data baselines, demonstrating strong domain generalization. A human study on NewsQA indicates generated data can match labeled data quality with sufficient samples, supporting broader applicability and suggesting avenues for further refinement and domain expansion.

Abstract

Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.

Prompting-based Synthetic Data Generation for Few-Shot Question Answering

TL;DR

The paper addresses data scarcity in extractive MRQA by proposing prompting-based synthetic data generation that leverages the linguistic knowledge of large language models. It introduces a two-stage pipeline: sample candidate answers via NER from contexts, then generate questions conditioned on the context and answer using a seq2seq LM modeled as , followed by training a MRQA model on synthetic data plus any available labels. The approach yields competitive or superior performance in few-shot settings across multiple datasets, with notable zero-shot capabilities and, in some cases, parity with full-data baselines, demonstrating strong domain generalization. A human study on NewsQA indicates generated data can match labeled data quality with sufficient samples, supporting broader applicability and suggesting avenues for further refinement and domain expansion.

Abstract

Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.
Paper Structure (31 sections, 1 equation, 4 figures, 3 tables)

This paper contains 31 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of a) common approaches, e.g., Prompting, for MRQA and b) our approach adding synthetic task- and domain-specific data without the need of additional labeled data.
  • Figure 2: An example of our data generation pipeline: We first sample answer candidates (using NER) and then prompt a PLM to generate a question conditioned on context and answer (1). The generated question-answer pair is then used with the initial context to train an MRQA model (2). We afterwards perform additional training on labeled data if available.
  • Figure 3: MRQA performance (F1) as a function of dataset sizes for the best performing approaches on the mean of all datasets in the few-shot MRQA benchmark.
  • Figure 4: For the NewsQA dataset, 100 question-answer pairs were quality-assessed by humans (question: "Is the question candidate correctly answered by the answer candidate?") in each setting (generated data taking 16 and 128 samples into account as well as labeled (gold) data).