Table of Contents
Fetching ...

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

Abdelrahman Abdallah, Mahmoud Kasem, Mahmoud Abdalla, Mohamed Mahmoud, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt

TL;DR

ArabicaQA tackles the scarcity of Arabic QA resources by introducing the first large-scale Arabic MRC and open-domain QA dataset, complemented by AraDPR, a dense passage retriever trained on Arabic Wikipedia. The work provides extensive benchmarking of diverse LLMs for Arabic QA and evaluates retrieval-augmented generation to assess how retrieved evidence enhances responses. Key contributions include the conception of a heterogeneous QA dataset with Elaborate and Concise answer types, a dedicated Arabic dense retriever, and a comprehensive evaluation of monolingual and multilingual models in Arabic QA. This resource is poised to advance Arabic NLP by improvingQA accuracy, information retrieval, and the development of language technologies tailored to Arabic.

Abstract

In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

TL;DR

ArabicaQA tackles the scarcity of Arabic QA resources by introducing the first large-scale Arabic MRC and open-domain QA dataset, complemented by AraDPR, a dense passage retriever trained on Arabic Wikipedia. The work provides extensive benchmarking of diverse LLMs for Arabic QA and evaluates retrieval-augmented generation to assess how retrieved evidence enhances responses. Key contributions include the conception of a heterogeneous QA dataset with Elaborate and Concise answer types, a dedicated Arabic dense retriever, and a comprehensive evaluation of monolingual and multilingual models in Arabic QA. This resource is poised to advance Arabic NLP by improvingQA accuracy, information retrieval, and the development of language technologies tailored to Arabic.

Abstract

In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.
Paper Structure (29 sections, 3 equations, 7 figures, 7 tables)

This paper contains 29 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: An example from the ArabicaQA dataset illustrating a passage about the Battle of Chosin Reservoir during the Korean War, along with corresponding question-answer pairs.
  • Figure 2: The workflow of the QA Dataset Generation Framework integrates human expertise across its stages. Starting with the Article Selection Module, where experts select relevant articles, the process moves to Question Generation, with human-crafted QA pairs. The Filtering Module then involves expert review to discard low-quality pairs, followed by the Classification Module where pairs are categorized into 'Concise Answer' and 'Elaborate Answer' types based on human assessment. In the Open Domain Module, these curated pairs are compiled into the final QA dataset, which later undergoes a Human Evaluation phase for assessing answerability, relevance, clarity, and fluency.
  • Figure 3: Elaborate/Concise answer examples
  • Figure 4: Distribution of entity types in the dataset
  • Figure 5: Bi-gram Frequencies from ArabicaQA Questions
  • ...and 2 more figures