Table of Contents
Fetching ...

Cross-Language Approach for Quranic QA

Islam Oshallah, Mohamed Basem, Ali Hamdi, Ammar Mohammed

TL;DR

The paper tackles the Quranic QA challenge posed by limited resources and the linguistic gap between Modern Standard Arabic questions and Classical Arabic verses. It proposes a cross-language pipeline that augments the dataset through translation and paraphrasing, and fine-tunes a range of transformer models (including RoBERTa-Base and DeBERTa-v3-Base) using SQuAD v2 as a training signal, complemented by cross-encoder architectures. Empirical results show that cross-language data augmentation and fine-tuning significantly improve retrieval metrics, with RoBERTa-Base achieving MAP@10 of 0.34 and MRR of 0.52, and DeBERTa-v3-Base reaching Recall@10 of 0.50 and Precision@10 of 0.24. These findings demonstrate the viability of cross-language strategies to enhance Quranic QA systems and provide a foundation for broader multilingual applications in religious text understanding.

Abstract

Question answering systems face critical limitations in languages with limited resources and scarce data, making the development of robust models especially challenging. The Quranic QA system holds significant importance as it facilitates a deeper understanding of the Quran, a Holy text for over a billion people worldwide. However, these systems face unique challenges, including the linguistic disparity between questions written in Modern Standard Arabic and answers found in Quranic verses written in Classical Arabic, and the small size of existing datasets, which further restricts model performance. To address these challenges, we adopt a cross-language approach by (1) Dataset Augmentation: expanding and enriching the dataset through machine translation to convert Arabic questions into English, paraphrasing questions to create linguistic diversity, and retrieving answers from an English translation of the Quran to align with multilingual training requirements; and (2) Language Model Fine-Tuning: utilizing pre-trained models such as BERT-Medium, RoBERTa-Base, DeBERTa-v3-Base, ELECTRA-Large, Flan-T5, Bloom, and Falcon to address the specific requirements of Quranic QA. Experimental results demonstrate that this cross-language approach significantly improves model performance, with RoBERTa-Base achieving the highest MAP@10 (0.34) and MRR (0.52), while DeBERTa-v3-Base excels in Recall@10 (0.50) and Precision@10 (0.24). These findings underscore the effectiveness of cross-language strategies in overcoming linguistic barriers and advancing Quranic QA systems

Cross-Language Approach for Quranic QA

TL;DR

The paper tackles the Quranic QA challenge posed by limited resources and the linguistic gap between Modern Standard Arabic questions and Classical Arabic verses. It proposes a cross-language pipeline that augments the dataset through translation and paraphrasing, and fine-tunes a range of transformer models (including RoBERTa-Base and DeBERTa-v3-Base) using SQuAD v2 as a training signal, complemented by cross-encoder architectures. Empirical results show that cross-language data augmentation and fine-tuning significantly improve retrieval metrics, with RoBERTa-Base achieving MAP@10 of 0.34 and MRR of 0.52, and DeBERTa-v3-Base reaching Recall@10 of 0.50 and Precision@10 of 0.24. These findings demonstrate the viability of cross-language strategies to enhance Quranic QA systems and provide a foundation for broader multilingual applications in religious text understanding.

Abstract

Question answering systems face critical limitations in languages with limited resources and scarce data, making the development of robust models especially challenging. The Quranic QA system holds significant importance as it facilitates a deeper understanding of the Quran, a Holy text for over a billion people worldwide. However, these systems face unique challenges, including the linguistic disparity between questions written in Modern Standard Arabic and answers found in Quranic verses written in Classical Arabic, and the small size of existing datasets, which further restricts model performance. To address these challenges, we adopt a cross-language approach by (1) Dataset Augmentation: expanding and enriching the dataset through machine translation to convert Arabic questions into English, paraphrasing questions to create linguistic diversity, and retrieving answers from an English translation of the Quran to align with multilingual training requirements; and (2) Language Model Fine-Tuning: utilizing pre-trained models such as BERT-Medium, RoBERTa-Base, DeBERTa-v3-Base, ELECTRA-Large, Flan-T5, Bloom, and Falcon to address the specific requirements of Quranic QA. Experimental results demonstrate that this cross-language approach significantly improves model performance, with RoBERTa-Base achieving the highest MAP@10 (0.34) and MRR (0.52), while DeBERTa-v3-Base excels in Recall@10 (0.50) and Precision@10 (0.24). These findings underscore the effectiveness of cross-language strategies in overcoming linguistic barriers and advancing Quranic QA systems

Paper Structure

This paper contains 12 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Workflow for Cross-Language Dataset Expansion and Fine-Tuning.
  • Figure 2: Cross-Language Example of Quranic QA: A question posed in Arabic is translated into English by using Google Translate API,then used to retrieve relevant passages from the English-translated Quran, and then translated back into Arabic.
  • Figure 3: Workflow for LLM & LM training