BanglaQuAD: A Bengali Open-domain Question Answering Dataset
Md Rashad Al Hasan Rony, Sudipto Kumar Shaha, Rakib Al Hasan, Sumon Kanti Dey, Amzad Hossain Rafi, Amzad Hossain Rafi, Ashraf Hasan Sirajee, Jens Lehmann
TL;DR
The paper introduces BanglaQuAD, a large open-domain Bengali QA dataset with 30,808 high-quality, human-annotated question-answer pairs drawn from 658 Bengali Wikipedia articles. It details a complete dataset-generation workflow, including data collection, annotation via the Bengali-centric BnAnno tool, and conversion to the SQuAD format, with unanswerable questions and variable-length answers to challenge QA systems. The authors provide extensive dataset statistics, baseline evaluations using Bengali-focused models (BanglaBERT and IndicBERT) against UDDIPOK and TYDI QA, and qualitative analyses to illustrate question types and annotation quality (kappa = 0.79). The work advances Bengali NLP by offering a high-quality benchmark, a practical annotation tool, and public data release to spur development of robust open-domain Bengali QA and information retrieval systems, with future plans for a leaderboard and expansion.
Abstract
Bengali is the seventh most spoken language on earth, yet considered a low-resource language in the field of natural language processing (NLP). Question answering over unstructured text is a challenging NLP task as it requires understanding both question and passage. Very few researchers attempted to perform question answering over Bengali (natively pronounced as Bangla) text. Typically, existing approaches construct the dataset by directly translating them from English to Bengali, which produces noisy and improper sentence structures. Furthermore, they lack topics and terminologies related to the Bengali language and people. This paper introduces BanglaQuAD, a Bengali question answering dataset, containing 30,808 question-answer pairs constructed from Bengali Wikipedia articles by native speakers. Additionally, we propose an annotation tool that facilitates question-answering dataset construction on a local machine. A qualitative analysis demonstrates the quality of our proposed dataset.
