ArabicaQA: A Comprehensive Dataset for Arabic Question Answering
Abdelrahman Abdallah, Mahmoud Kasem, Mahmoud Abdalla, Mohamed Mahmoud, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt
TL;DR
ArabicaQA tackles the scarcity of Arabic QA resources by introducing the first large-scale Arabic MRC and open-domain QA dataset, complemented by AraDPR, a dense passage retriever trained on Arabic Wikipedia. The work provides extensive benchmarking of diverse LLMs for Arabic QA and evaluates retrieval-augmented generation to assess how retrieved evidence enhances responses. Key contributions include the conception of a heterogeneous QA dataset with Elaborate and Concise answer types, a dedicated Arabic dense retriever, and a comprehensive evaluation of monolingual and multilingual models in Arabic QA. This resource is poised to advance Arabic NLP by improvingQA accuracy, information retrieval, and the development of language technologies tailored to Arabic.
Abstract
In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.
