Table of Contents
Fetching ...

UQA: Corpus for Urdu Question Answering

Samee Arif, Sualeha Farid, Awais Athar, Agha Ali Raza

TL;DR

Urdu is a high-demand but under-resourced language for QA. The authors translate SQuAD2.0 into Urdu using the Enclose to Anchor, Translate, Seek (EATS) method and select Seamless M4T as the translation engine based on human evaluation, producing the UQA dataset. UQA enables benchmarking of multilingual QA models, with XLM-RoBERTa-XL achieving the top performance (F1 85.99, EM 74.56), surpassing several Urdu baselines and related multilingual datasets. The resource and methodology lay groundwork for cross-lingual transfer, open Urdu NLP development, and future domain-specific QA growth without translation via generation-based approaches.

Abstract

This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA.

UQA: Corpus for Urdu Question Answering

TL;DR

Urdu is a high-demand but under-resourced language for QA. The authors translate SQuAD2.0 into Urdu using the Enclose to Anchor, Translate, Seek (EATS) method and select Seamless M4T as the translation engine based on human evaluation, producing the UQA dataset. UQA enables benchmarking of multilingual QA models, with XLM-RoBERTa-XL achieving the top performance (F1 85.99, EM 74.56), surpassing several Urdu baselines and related multilingual datasets. The resource and methodology lay groundwork for cross-lingual transfer, open Urdu NLP development, and future domain-specific QA growth without translation via generation-based approaches.

Abstract

This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA.
Paper Structure (13 sections, 2 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Question types example from SQuAD2.0
  • Figure 2: Three-step solution