Table of Contents
Fetching ...

Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision

Fan Jiang, Tom Drummond, Trevor Cohn

TL;DR

CLASS introduces a unified encoder–decoder framework for cross-lingual open-domain QA by learning both retrieval and answer generation in a single model. It employs a two-stage self-supervised pre-training regime: cross-lingual retrieval pre-training (CLR-PT) that distills from an English teacher using cloze-style parallel queries, and multilingual QA pre-training (MLQA-PT) that uses anchor-text-derived data and LLM-assisted query transformation to produce natural questions; asynchronous passage updates keep retrieval aligned during training. Across XOR-TyDi and MKQA benchmarks, CLASS achieves strong unsupervised and zero-shot performance, with competitive few-shot gains, and demonstrates that end-to-end pre-training can surpass traditional MT-augmented or separately trained retrievers and readers. The work highlights scalable cross-lingual transfer and suggests directions for extending coverage to more languages and mitigating training-resource demands and biases inherent to large LLMs.

Abstract

Cross-lingual open domain question answering (CLQA) is a complex problem, comprising cross-lingual retrieval from a multilingual knowledge base, followed by answer generation in the query language. Both steps are usually tackled by separate models, requiring substantial annotated datasets, and typically auxiliary resources, like machine translation systems to bridge between languages. In this paper, we show that CLQA can be addressed using a single encoder-decoder model. To effectively train this model, we propose a self-supervised method based on exploiting the cross-lingual link structure within Wikipedia. We demonstrate how linked Wikipedia pages can be used to synthesise supervisory signals for cross-lingual retrieval, through a form of cloze query, and generate more natural questions to supervise answer generation. Together, we show our approach, \texttt{CLASS}, outperforms comparable methods on both supervised and zero-shot language adaptation settings, including those using machine translation.

Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision

TL;DR

CLASS introduces a unified encoder–decoder framework for cross-lingual open-domain QA by learning both retrieval and answer generation in a single model. It employs a two-stage self-supervised pre-training regime: cross-lingual retrieval pre-training (CLR-PT) that distills from an English teacher using cloze-style parallel queries, and multilingual QA pre-training (MLQA-PT) that uses anchor-text-derived data and LLM-assisted query transformation to produce natural questions; asynchronous passage updates keep retrieval aligned during training. Across XOR-TyDi and MKQA benchmarks, CLASS achieves strong unsupervised and zero-shot performance, with competitive few-shot gains, and demonstrates that end-to-end pre-training can surpass traditional MT-augmented or separately trained retrievers and readers. The work highlights scalable cross-lingual transfer and suggests directions for extending coverage to more languages and mitigating training-resource demands and biases inherent to large LLMs.

Abstract

Cross-lingual open domain question answering (CLQA) is a complex problem, comprising cross-lingual retrieval from a multilingual knowledge base, followed by answer generation in the query language. Both steps are usually tackled by separate models, requiring substantial annotated datasets, and typically auxiliary resources, like machine translation systems to bridge between languages. In this paper, we show that CLQA can be addressed using a single encoder-decoder model. To effectively train this model, we propose a self-supervised method based on exploiting the cross-lingual link structure within Wikipedia. We demonstrate how linked Wikipedia pages can be used to synthesise supervisory signals for cross-lingual retrieval, through a form of cloze query, and generate more natural questions to supervise answer generation. Together, we show our approach, \texttt{CLASS}, outperforms comparable methods on both supervised and zero-shot language adaptation settings, including those using machine translation.
Paper Structure (54 sections, 9 equations, 12 figures, 14 tables)

This paper contains 54 sections, 9 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: The overview of our two-stage unsupervised pre-training method for cross-lingual open domain question answering. English translations from Google Translate are added in (b) for readability.
  • Figure 2: The unified model for passage retrieval and question answering.
  • Figure 3: Our training pipeline. CLR: cross-lingual retrieval, MLQA: multilingual question answering, QT: query transformation, PT: pre-training, FT: fine-tuning.
  • Figure 4: Zero-shot cross-lingual retrieval and multilingual QA results on unseen languages of MKQA. Macro average results across all test languages are reported. Languages included are: Da, De, Es, Fr, He, Hu, It, Km, Ms, Nl, No, Pl, Pt, Sv, Th, Tr, Vi, Zh-cn, Zh-hk, and Zh-tw.
  • Figure 5: Ablations on cross-lingual retrieval pre-training, with results on the XOR-Retrieve dev set reported. $^{\ast}$ indicates unseen languages from MKQA.
  • ...and 7 more figures