Table of Contents
Fetching ...

MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering

Xiusi Chen, Jyun-Yu Jiang, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Wei Wang

TL;DR

MinPrompt tackles data-efficiency in few-shot open-domain QA by constructing a sentence graph from raw text, then applying a greedy $(\ln \Delta + 2)$-approximation to derive a minimal dominating set of informative sentences. These sentences are transformed into QA pairs through unsupervised question generation and turned into prompt-style training data, which are combined with original data to train an encoder-decoder model under a weighted loss $L(\theta)=L^{ori}(\theta)+\lambda L^{aug}(\theta)$. Across eight MRQA benchmarks in 16–128-shot settings, MinPrompt delivers comparable or superior F1 with reduced variance, demonstrating the value of principled data selection and graph-based representations for data-efficient QA. The approach emphasizes structured text understanding and targeted augmentation to reduce annotation burden while maintaining or improving performance in open-domain QA scenarios.

Abstract

Recent advances in few-shot question answering (QA) mostly rely on the power of pre-trained large language models (LLMs) and fine-tuning in specific settings. Although the pre-training stage has already equipped LLMs with powerful reasoning capabilities, LLMs still need to be fine-tuned to adapt to specific domains to achieve the best results. In this paper, we propose to select the most informative data for fine-tuning, thereby improving the efficiency of the fine-tuning process with comparative or even better accuracy on the open-domain QA task. We present MinPrompt, a minimal data augmentation framework for open-domain QA based on an approximate graph algorithm and unsupervised question generation. We transform the raw text into a graph structure to build connections between different factual sentences, then apply graph algorithms to identify the minimal set of sentences needed to cover the most information in the raw text. We then generate QA pairs based on the identified sentence subset and train the model on the selected sentences to obtain the final model. Empirical results on several benchmark datasets and theoretical analysis show that MinPrompt is able to achieve comparable or better results than baselines with a high degree of efficiency, bringing consistent improvements in F-1 scores.

MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering

TL;DR

MinPrompt tackles data-efficiency in few-shot open-domain QA by constructing a sentence graph from raw text, then applying a greedy -approximation to derive a minimal dominating set of informative sentences. These sentences are transformed into QA pairs through unsupervised question generation and turned into prompt-style training data, which are combined with original data to train an encoder-decoder model under a weighted loss . Across eight MRQA benchmarks in 16–128-shot settings, MinPrompt delivers comparable or superior F1 with reduced variance, demonstrating the value of principled data selection and graph-based representations for data-efficient QA. The approach emphasizes structured text understanding and targeted augmentation to reduce annotation burden while maintaining or improving performance in open-domain QA scenarios.

Abstract

Recent advances in few-shot question answering (QA) mostly rely on the power of pre-trained large language models (LLMs) and fine-tuning in specific settings. Although the pre-training stage has already equipped LLMs with powerful reasoning capabilities, LLMs still need to be fine-tuned to adapt to specific domains to achieve the best results. In this paper, we propose to select the most informative data for fine-tuning, thereby improving the efficiency of the fine-tuning process with comparative or even better accuracy on the open-domain QA task. We present MinPrompt, a minimal data augmentation framework for open-domain QA based on an approximate graph algorithm and unsupervised question generation. We transform the raw text into a graph structure to build connections between different factual sentences, then apply graph algorithms to identify the minimal set of sentences needed to cover the most information in the raw text. We then generate QA pairs based on the identified sentence subset and train the model on the selected sentences to obtain the final model. Empirical results on several benchmark datasets and theoretical analysis show that MinPrompt is able to achieve comparable or better results than baselines with a high degree of efficiency, bringing consistent improvements in F-1 scores.
Paper Structure (24 sections, 1 theorem, 7 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 1 theorem, 7 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Algorithm alg:dominantingset computes an $(\ln \Delta + 2)$-approximation of the optimal dominanting set. In other words, for the computed dominating set $S$ and an optimal dominating set $S^*$, we have where $\Delta=\max_v d(v)$ is the maximal degree of $G$.

Figures (4)

  • Figure 1: Framework overview for MinPrompt.
  • Figure 2: Illustration of the Sentence graph. In the sentence graph, nodes correspond to sentences and edges represent the coreference of entities across sentences. Sentences 1, 2 and 3 shares the entity Lakers while sentence 4 shares the entity Crypto.com Arena with sentence 3.
  • Figure 3: Examples of generated questions. When MinPrompt runs into an $entity$ in the raw text during the question generation phase, it turns the factual sentence into a QA pair of $(question, entity)$, with the question type depending on the entity type.
  • Figure 4: Case study. In both cases, MinPrompt successfully generates the correct answer, whereas baselines without entity masking can not accurately recover the entity-level details.

Theorems & Definitions (2)

  • Theorem 1
  • proof