Table of Contents
Fetching ...

Few-Shot Multilingual Open-Domain QA from 5 Examples

Fan Jiang, Tom Drummond, Trevor Cohn

TL;DR

FsModQA tackles multilingual open-domain QA under annotation scarcity by combining a WikiData-based self-supervised pretraining stage with a synthetic data generation pipeline that uses few-shot prompts from large language models. The approach yields a unified retrieval-and-generation model trained on 18.7M MlWikiQA triples and 1.7M FsMlQA synthetic QAs, enabling strong performance in both cross-lingual retrieval and multilingual QA, including zero-shot adaptation to unseen languages. Ablation and scaling studies show pretraining, cross-lingual data, and careful data filtering are crucial, while zero-shot prompting and English-prompting strategies offer practical language-adaptation pathways without costly annotation. Overall, FsModQA significantly narrows the gap to supervised multilingual baselines and demonstrates effective, data-efficient language adaptation with practical implications for deploying open-domain QA in underrepresented languages.

Abstract

Recent approaches to multilingual open-domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a \emph{few-shot learning} approach to synthesise large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, \textsc{FsModQA}, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a \emph{cross-lingual prompting} strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.

Few-Shot Multilingual Open-Domain QA from 5 Examples

TL;DR

FsModQA tackles multilingual open-domain QA under annotation scarcity by combining a WikiData-based self-supervised pretraining stage with a synthetic data generation pipeline that uses few-shot prompts from large language models. The approach yields a unified retrieval-and-generation model trained on 18.7M MlWikiQA triples and 1.7M FsMlQA synthetic QAs, enabling strong performance in both cross-lingual retrieval and multilingual QA, including zero-shot adaptation to unseen languages. Ablation and scaling studies show pretraining, cross-lingual data, and careful data filtering are crucial, while zero-shot prompting and English-prompting strategies offer practical language-adaptation pathways without costly annotation. Overall, FsModQA significantly narrows the gap to supervised multilingual baselines and demonstrates effective, data-efficient language adaptation with practical implications for deploying open-domain QA in underrepresented languages.

Abstract

Recent approaches to multilingual open-domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a \emph{few-shot learning} approach to synthesise large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, \textsc{FsModQA}, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a \emph{cross-lingual prompting} strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.

Paper Structure

This paper contains 54 sections, 2 equations, 10 figures, 21 tables.

Figures (10)

  • Figure 1: Left (a): The process of multilingual open-domain QA. Middle (b): training strategies: 1) self-supervised pre-training; 2) fine-tuning on English QA; 3) translate English QA to target languages; 4) use English data to prompt LLMs to generate target language data; 5) use few-shot in-language data for LLM prompting. Right (c): Performance comparison (Avg. F1) on the XOR-Full dataset.
  • Figure 2: Full pipeline for data construction and model training: (1) generate large-scale data from Wikidata for self-supervised pre-training; (2) use few-shot prompting to generate synthetic Q&A pairs from Wikipedia passages of target languages, on which the pre-trained model is further fine-tuned.
  • Figure 3: Pre-training data construction pipeline: (1) transform WikiData triples into QAs using LLMs for each target language $L$, and (2) identify in-language and cross-lingual positive passages from the head entity's Wikipedia page and through language links. English translations are added for readability.
  • Figure 4: The unified model for passage retrieval and question answering.
  • Figure 5: Performance when trained with different sizes of our synthetic data.
  • ...and 5 more figures