Table of Contents
Fetching ...

KazQAD: Kazakh Open-Domain Question Answering Dataset

Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov, Ardak Shalkarbayuli, Pavel Braslavski

TL;DR

KazQAD addresses the need for robust QA resources in Kazakh by assembling an open-domain QA dataset that supports IR, RC, and ODQA. It leverages a hybrid data pipeline—NQ translations for training and UNT-based Kazakh questions for development/testing—anchored by a Kazakh Wikipedia corpus of over 800k passages. The authors implement and evaluate strong baselines across retrieval, machine reading, and end-to-end QA, including a pilot assessment of ChatGPT, showing clear room for improvement in Kazakh QA compared to English benchmarks. The dataset, tools, and baselines are publicly available under CC BY-SA, providing a valuable resource to advance NLP and IR for Kazakh and other low-resource languages.

Abstract

We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at https://github.com/IS2AI/KazQAD.

KazQAD: Kazakh Open-Domain Question Answering Dataset

TL;DR

KazQAD addresses the need for robust QA resources in Kazakh by assembling an open-domain QA dataset that supports IR, RC, and ODQA. It leverages a hybrid data pipeline—NQ translations for training and UNT-based Kazakh questions for development/testing—anchored by a Kazakh Wikipedia corpus of over 800k passages. The authors implement and evaluate strong baselines across retrieval, machine reading, and end-to-end QA, including a pilot assessment of ChatGPT, showing clear room for improvement in Kazakh QA compared to English benchmarks. The dataset, tools, and baselines are publicly available under CC BY-SA, providing a valuable resource to advance NLP and IR for Kazakh and other low-resource languages.

Abstract

We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at https://github.com/IS2AI/KazQAD.
Paper Structure (21 sections, 1 figure, 6 tables)

This paper contains 21 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: KazQAD pipeline: Q -- question, P -- passage, S -- sentence, A -- answer; subscripts: en -- English, kk -- Kazakh. The upper part corresponds to the training set, while the lower part represents the development and test sets. In the case of the training set, we start with the pre-selected NQ questions and in-context answers machine-translated into Kazakh. Candidate passages are extracted from the parallel Kazakh Wikipedia pages. In the case of the development and test sets, the starting point is an original Kazakh question-answer pair from the UNT collection. Passages are retrieved from the Kazakh Wikipedia using Google search. In both annotation scenarios, the annotators are thus presented with question-paragraph pairs along with candidate answers.