Table of Contents
Fetching ...

GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning

Amani Namboori, Shivam Mangale, Andy Rosenbaum, Saleh Soltan

TL;DR

GeMQuAD tackles multilingual extractive QA data scarcity by leveraging 1-shot in-context learning on AlexaTM 20B to generate synthetic Q&A, followed by a WeakDAP-based semi-supervised filter to select high-quality pairs for XLM-R-base fine-tuning. The method iterates up to $k=2$ rounds, progressively refining the silver data before combining with a gold English dataset, all without fine-tuning the generator. Empirically, GeMQuAD improves Hindi and Spanish QA performance on MLQA and XQUAD relative to MT augmentation and English-only baselines, with substantial cross-lingual gains even for languages not present in the student’s fine-tuning data. These results demonstrate a cost-effective, data-efficient pathway for high-quality multilingual QA data generation that can extend to additional languages and domains, including future work on abstractive QA.

Abstract

The emergence of Large Language Models (LLMs) with capabilities like In-Context Learning (ICL) has ushered in new possibilities for data generation across various domains while minimizing the need for extensive data collection and modeling techniques. Researchers have explored ways to use this generated synthetic data to optimize smaller student models for reduced deployment costs and lower latency in downstream tasks. However, ICL-generated data often suffers from low quality as the task specificity is limited with few examples used in ICL. In this paper, we propose GeMQuAD - a semi-supervised learning approach, extending the WeakDAP framework, applied to a dataset generated through ICL with just one example in the target language using AlexaTM 20B Seq2Seq LLM. Through our approach, we iteratively identify high-quality data to enhance model performance, especially for low-resource multilingual setting in the context of Extractive Question Answering task. Our framework outperforms the machine translation-augmented model by 0.22/1.68 F1/EM (Exact Match) points for Hindi and 0.82/1.37 F1/EM points for Spanish on the MLQA dataset, and it surpasses the performance of model trained on an English-only dataset by 5.05/6.50 F1/EM points for Hindi and 3.81/3.69 points F1/EM for Spanish on the same dataset. Notably, our approach uses a pre-trained LLM for generation with no fine-tuning (FT), utilizing just a single annotated example in ICL to generate data, providing a cost-effective development process.

GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning

TL;DR

GeMQuAD tackles multilingual extractive QA data scarcity by leveraging 1-shot in-context learning on AlexaTM 20B to generate synthetic Q&A, followed by a WeakDAP-based semi-supervised filter to select high-quality pairs for XLM-R-base fine-tuning. The method iterates up to rounds, progressively refining the silver data before combining with a gold English dataset, all without fine-tuning the generator. Empirically, GeMQuAD improves Hindi and Spanish QA performance on MLQA and XQUAD relative to MT augmentation and English-only baselines, with substantial cross-lingual gains even for languages not present in the student’s fine-tuning data. These results demonstrate a cost-effective, data-efficient pathway for high-quality multilingual QA data generation that can extend to additional languages and domains, including future work on abstractive QA.

Abstract

The emergence of Large Language Models (LLMs) with capabilities like In-Context Learning (ICL) has ushered in new possibilities for data generation across various domains while minimizing the need for extensive data collection and modeling techniques. Researchers have explored ways to use this generated synthetic data to optimize smaller student models for reduced deployment costs and lower latency in downstream tasks. However, ICL-generated data often suffers from low quality as the task specificity is limited with few examples used in ICL. In this paper, we propose GeMQuAD - a semi-supervised learning approach, extending the WeakDAP framework, applied to a dataset generated through ICL with just one example in the target language using AlexaTM 20B Seq2Seq LLM. Through our approach, we iteratively identify high-quality data to enhance model performance, especially for low-resource multilingual setting in the context of Extractive Question Answering task. Our framework outperforms the machine translation-augmented model by 0.22/1.68 F1/EM (Exact Match) points for Hindi and 0.82/1.37 F1/EM points for Spanish on the MLQA dataset, and it surpasses the performance of model trained on an English-only dataset by 5.05/6.50 F1/EM points for Hindi and 3.81/3.69 points F1/EM for Spanish on the same dataset. Notably, our approach uses a pre-trained LLM for generation with no fine-tuning (FT), utilizing just a single annotated example in ICL to generate data, providing a cost-effective development process.
Paper Structure (13 sections, 6 figures, 5 tables)

This paper contains 13 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: 1-Shot example prompts used to generate Spanish and Hindi synthetic data respectively on AlexaTM 20B model using ICL. Prompt instructions are mentioned in English and data is in the target language. For readability, instruction part has been bolded & italicized. An example synthetic Q&A pair generated from model presented under Result header.
  • Figure 2: The semi-supervised fine-tuning approach of student using the data generated from teacher. Data generation is a one-time step in which an LLM (AlexaTM) is utilized to create synthetic data using few-shot learning. The data is then passed to the Data Filtering & Model Tuning stage, where it iteratively filters high-quality records and enhances the labeling model through fine-tuning student (XLM-R-base) until optimal performance is achieved, based on predetermined stopping criteria.
  • Figure 3: Number of records across datasets and its purpose. Represented average number across languages for MLQA & XQUAD.
  • Figure 4: The number of synthetic data samples classified as correct for each iteration
  • Figure 5: The F1 for Squad Validation split after each FT step. The horizontal green dashed lines signify the performance of the Best (Third) round in the at each step.
  • ...and 1 more figures