Table of Contents
Fetching ...

MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

Shayne Longpre, Yi Lu, Joachim Daiber

TL;DR

MKQA addresses the lack of linguistically diverse, realistic open-domain QA evaluation by providing 10k English questions translated into 26 languages with retrieval-independent, Wikidata-grounded answers, enabling fair cross-language comparison. The dataset emphasizes parallel questions, geographic invariance, and broad typological diversity to minimize translation artifacts and support multiple QA paradigms. The paper details a six-stage data collection pipeline, analyzes quality and translation reliability, and benchmarks a suite of baselines (retrieval-based, translation-based, and generative models) showing MKQA's increased difficulty over English-only datasets. The work offers a practical benchmark for evaluating multilingual QA systems and highlights directions for future cross-lingual QA research.

Abstract

Progress in cross-lingual modeling depends on challenging, realistic, and diverse evaluation sets. We introduce Multilingual Knowledge Questions and Answers (MKQA), an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). Answers are based on a heavily curated, language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark a variety of state-of-the-art methods and baselines for generative and extractive question answering, trained on Natural Questions, in zero shot and translation settings. Results indicate this dataset is challenging even in English, but especially in low-resource languages

MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

TL;DR

MKQA addresses the lack of linguistically diverse, realistic open-domain QA evaluation by providing 10k English questions translated into 26 languages with retrieval-independent, Wikidata-grounded answers, enabling fair cross-language comparison. The dataset emphasizes parallel questions, geographic invariance, and broad typological diversity to minimize translation artifacts and support multiple QA paradigms. The paper details a six-stage data collection pipeline, analyzes quality and translation reliability, and benchmarks a suite of baselines (retrieval-based, translation-based, and generative models) showing MKQA's increased difficulty over English-only datasets. The work offers a practical benchmark for evaluating multilingual QA systems and highlights directions for future cross-lingual QA research.

Abstract

Progress in cross-lingual modeling depends on challenging, realistic, and diverse evaluation sets. We introduce Multilingual Knowledge Questions and Answers (MKQA), an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). Answers are based on a heavily curated, language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark a variety of state-of-the-art methods and baselines for generative and extractive question answering, trained on Natural Questions, in zero shot and translation settings. Results indicate this dataset is challenging even in English, but especially in low-resource languages

Paper Structure

This paper contains 38 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Data Collection Process. A depiction of the 6 sequential steps in our data collection pipeline. The first four steps involve Answer Curation, and the last two localize questions and answers into 26 target languages.
  • Figure 2: Answer Type Breakdown. Compares the distribution of answer types between MKQA and Natural Questions (NQ) for the $10k$ examples in the evaluation set.
  • Figure 3: F1 by language.Xlm-R Zero-Shot performance ranked by language. Unanswerable F1 (in red) corresponds to the proportion of the Aggregate F1 obtained from predicting No Answer. The Unanswerable proportion is calculated as the percentage of unanswerable examples ($32.42\%$) multiplied by the Unanswerable F1.
  • Figure 4: Comparing MKQA and NQ English annotations. The performance of the same English Bert-Large model on each of Natural Questions (NQ) annotations and MKQA annotations, using the MKQA evaluation metrics. For all plots the y-axis is F1 score and the x-axis is the value of the threshold over No Answer probabilities. F1 by Answer Type (left diagram) compares the accuracy of the model on Answerable and Unanswerable examples for each dataset, showing Unanswerable examples are on average easier in MKQA, and Answerable examples are on average harder in MKQA. NQ F1 Proportions (middle) and MKQA F1 Proportions (right) show what proportion of the aggregate F1 score is derived from each Answer Type. These plots demonstrate MKQA is more difficult than NQ because there is a higher proportion of answerable questions, which are harder on average.