Table of Contents
Fetching ...

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

Maithili Sabane, Onkar Litake, Aman Chadha

TL;DR

This work addresses the scarcity of QA data for Hindi and Marathi by translating the SQuAD 2.0 dataset into both languages, creating 28,000 samples and aligning translated answers via a sliding-window similarity method. The authors evaluate a suite of Transformer-based models (including language-specific451 versions) and identify HindiBERT as the top performer for Hindi QA and MahaBERT for Marathi QA, underscoring the advantage of language-specific pretraining. They demonstrate that large, high-quality QA datasets in these languages can be produced with translation-based pipelines and similarity-based alignment, and they release the dataset, models, and code to spur further research and potential extension to other Indic languages. Overall, the dataset and findings advance QA research for Hindi and Marathi, enabling more accurate, accessible AI tools for these sizable language communities.

Abstract

The recent advances in deep-learning have led to the development of highly sophisticated systems with an unquenchable appetite for data. On the other hand, building good deep-learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi. Despite Hindi being the 3rd most spoken language worldwide, with 345 million speakers, and Marathi being the 11th most spoken language globally, with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To tackle the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the dataset on various architectures and release the best-performing models for both Hindi and Marathi, which will facilitate further research in these languages. Leveraging similarity tools, our method holds the potential to create datasets in diverse languages, thereby enhancing the understanding of natural language across varied linguistic contexts. Our fine-tuned models, code, and dataset will be made publicly available.

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

TL;DR

This work addresses the scarcity of QA data for Hindi and Marathi by translating the SQuAD 2.0 dataset into both languages, creating 28,000 samples and aligning translated answers via a sliding-window similarity method. The authors evaluate a suite of Transformer-based models (including language-specific451 versions) and identify HindiBERT as the top performer for Hindi QA and MahaBERT for Marathi QA, underscoring the advantage of language-specific pretraining. They demonstrate that large, high-quality QA datasets in these languages can be produced with translation-based pipelines and similarity-based alignment, and they release the dataset, models, and code to spur further research and potential extension to other Indic languages. Overall, the dataset and findings advance QA research for Hindi and Marathi, enabling more accurate, accessible AI tools for these sizable language communities.

Abstract

The recent advances in deep-learning have led to the development of highly sophisticated systems with an unquenchable appetite for data. On the other hand, building good deep-learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi. Despite Hindi being the 3rd most spoken language worldwide, with 345 million speakers, and Marathi being the 11th most spoken language globally, with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To tackle the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the dataset on various architectures and release the best-performing models for both Hindi and Marathi, which will facilitate further research in these languages. Leveraging similarity tools, our method holds the potential to create datasets in diverse languages, thereby enhancing the understanding of natural language across varied linguistic contexts. Our fine-tuned models, code, and dataset will be made publicly available.
Paper Structure (12 sections, 2 figures, 3 tables)

This paper contains 12 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Sliding window technique for answer extraction.
  • Figure 2: Illustration of the method using an English example.