Table of Contents
Fetching ...

MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Dan Saattrup Smart

TL;DR

A new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages and has 1,220,757 samples in total is introduced, and 6 different language models are evaluated, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages.

Abstract

We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages and has 1,220,757 samples in total. We start with Wikipedia articles, which also provide the context for the dataset samples, and use an LLM to generate question/answer pairs related to the Wikipedia article, ensuring that the answer appears verbatim within the article. Next, the question is then rephrased to hinder simple word matching methods from performing well on the dataset. We conduct a crowdsourced human evaluation of the fluency of the generated questions, which included 156 respondents across 30 of the languages (both low- and high-resource). All 30 languages received a mean fluency rating above ``mostly natural'', showing that the samples are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. Both the dataset and survey evaluations are publicly available.

MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

TL;DR

A new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages and has 1,220,757 samples in total is introduced, and 6 different language models are evaluated, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages.

Abstract

We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages and has 1,220,757 samples in total. We start with Wikipedia articles, which also provide the context for the dataset samples, and use an LLM to generate question/answer pairs related to the Wikipedia article, ensuring that the answer appears verbatim within the article. Next, the question is then rephrased to hinder simple word matching methods from performing well on the dataset. We conduct a crowdsourced human evaluation of the fluency of the generated questions, which included 156 respondents across 30 of the languages (both low- and high-resource). All 30 languages received a mean fluency rating above ``mostly natural'', showing that the samples are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. Both the dataset and survey evaluations are publicly available.

Paper Structure

This paper contains 11 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: MultiWikiQA dataset generation process.
  • Figure 2: The system and user prompt used to generate the tentative questions and answers.
  • Figure 3: The prompt used to rephrase the generated tentative questions.
  • Figure 4: The preamble used in all of the surveys.
  • Figure 5: Results from the conducted fluency surveys.
  • ...and 3 more figures