Table of Contents
Fetching ...

MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering

Ruturaj Ghatage, Aditya Kulkarni, Rajlaxmi Patil, Sharvi Endait, Raviraj Joshi

TL;DR

This work addresses the lack of QA datasets for Marathi by translating SQuAD v2.0 into Marathi using a robust span-mapping method that aligns translated answers within translated passages. It releases MahaSQuAD, a large-scale Marathi QA corpus with train/validation/test splits and a gold 500-sample set, along with language-specific models MahaBERT and MahaRoBERTa that achieve strong performance. The authors present a detailed dataset-creation pipeline involving sentence-level translation, similarity-based span extraction, and transliteration, enabling scalable cross-lingual QA data generation for low-resource languages. Empirical results show that Marathi-specialized models outperform multilingual baselines, indicating the practical impact of language-tuned QA resources for information access in Marathi-speaking communities.

Abstract

Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages by translating the English Question Answering Dataset (SQuAD) using a robust data curation approach. We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples. We also present a gold test set of manually verified 500 examples. Challenges in maintaining context and handling linguistic nuances are addressed, ensuring accurate translations. Moreover, as a QnA dataset cannot be simply converted into any low-resource language using translation, we need a robust method to map the answer translation to its span in the translated passage. Hence, to address this challenge, we also present a generic approach for translating SQuAD into any low-resource language. Thus, we offer a scalable approach to bridge linguistic and cultural gaps present in low-resource languages, in the realm of question-answering systems. The datasets and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP .

MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering

TL;DR

This work addresses the lack of QA datasets for Marathi by translating SQuAD v2.0 into Marathi using a robust span-mapping method that aligns translated answers within translated passages. It releases MahaSQuAD, a large-scale Marathi QA corpus with train/validation/test splits and a gold 500-sample set, along with language-specific models MahaBERT and MahaRoBERTa that achieve strong performance. The authors present a detailed dataset-creation pipeline involving sentence-level translation, similarity-based span extraction, and transliteration, enabling scalable cross-lingual QA data generation for low-resource languages. Empirical results show that Marathi-specialized models outperform multilingual baselines, indicating the practical impact of language-tuned QA resources for information access in Marathi-speaking communities.

Abstract

Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages by translating the English Question Answering Dataset (SQuAD) using a robust data curation approach. We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples. We also present a gold test set of manually verified 500 examples. Challenges in maintaining context and handling linguistic nuances are addressed, ensuring accurate translations. Moreover, as a QnA dataset cannot be simply converted into any low-resource language using translation, we need a robust method to map the answer translation to its span in the translated passage. Hence, to address this challenge, we also present a generic approach for translating SQuAD into any low-resource language. Thus, we offer a scalable approach to bridge linguistic and cultural gaps present in low-resource languages, in the realm of question-answering systems. The datasets and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP .
Paper Structure (15 sections, 3 figures, 3 tables)

This paper contains 15 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: This figure illustrates the sentence containing the answer, the translated answer, and the answer text from the passage with has the highest similarity with the translated answer.
  • Figure 2: Algorithm for obtaining the answer and the answer span from the context
  • Figure 3: Left: An English SQuAD example, Right: Corresponding entry from MahaSQuAD