Table of Contents
Fetching ...

Quranic Audio Dataset: Crowdsourced and Labeled Recitation from Non-Arabic Speakers

Raghad Salameh, Mohamad Al Mdfaa, Nursultan Askarbekuly, Manuel Mazzara

TL;DR

This work tackles the scarcity of labeled Quranic recitation data for non-Arabic learners by crowdsourcing audio and annotations through NamazApp and a dedicated Quran Voice platform. It details an altruistic crowdsourcing workflow, an audio standardization pipeline, and an annotation protocol with training and quality controls. The authors report about 7,000 recitations from 1,287 participants across 11 countries and 1,166 labeled instances, with crowd MCC 0.68 and Krippendorff's alpha 0.63, along with strong algorithm-expert agreement (~0.89–0.91). The study demonstrates the feasibility of collecting high-quality Quranic audio data at scale and outlines strategies to improve annotation reliability and future Tajweed-focused labeling, enabling AI-assisted learning tools for non-Arabic Muslims.

Abstract

This paper addresses the challenge of learning to recite the Quran for non-Arabic speakers. We explore the possibility of crowdsourcing a carefully annotated Quranic dataset, on top of which AI models can be built to simplify the learning process. In particular, we use the volunteer-based crowdsourcing genre and implement a crowdsourcing API to gather audio assets. We integrated the API into an existing mobile application called NamazApp to collect audio recitations. We developed a crowdsourcing platform called Quran Voice for annotating the gathered audio assets. As a result, we have collected around 7000 Quranic recitations from a pool of 1287 participants across more than 11 non-Arabic countries, and we have annotated 1166 recitations from the dataset in six categories. We have achieved a crowd accuracy of 0.77, an inter-rater agreement of 0.63 between the annotators, and 0.89 between the labels assigned by the algorithm and the expert judgments.

Quranic Audio Dataset: Crowdsourced and Labeled Recitation from Non-Arabic Speakers

TL;DR

This work tackles the scarcity of labeled Quranic recitation data for non-Arabic learners by crowdsourcing audio and annotations through NamazApp and a dedicated Quran Voice platform. It details an altruistic crowdsourcing workflow, an audio standardization pipeline, and an annotation protocol with training and quality controls. The authors report about 7,000 recitations from 1,287 participants across 11 countries and 1,166 labeled instances, with crowd MCC 0.68 and Krippendorff's alpha 0.63, along with strong algorithm-expert agreement (~0.89–0.91). The study demonstrates the feasibility of collecting high-quality Quranic audio data at scale and outlines strategies to improve annotation reliability and future Tajweed-focused labeling, enabling AI-assisted learning tools for non-Arabic Muslims.

Abstract

This paper addresses the challenge of learning to recite the Quran for non-Arabic speakers. We explore the possibility of crowdsourcing a carefully annotated Quranic dataset, on top of which AI models can be built to simplify the learning process. In particular, we use the volunteer-based crowdsourcing genre and implement a crowdsourcing API to gather audio assets. We integrated the API into an existing mobile application called NamazApp to collect audio recitations. We developed a crowdsourcing platform called Quran Voice for annotating the gathered audio assets. As a result, we have collected around 7000 Quranic recitations from a pool of 1287 participants across more than 11 non-Arabic countries, and we have annotated 1166 recitations from the dataset in six categories. We have achieved a crowd accuracy of 0.77, an inter-rater agreement of 0.63 between the annotators, and 0.89 between the labels assigned by the algorithm and the expert judgments.
Paper Structure (18 sections, 7 figures, 3 tables)

This paper contains 18 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Quran player Architecture for the first crowdsourcing task
  • Figure 2: Audio standardization pipeline
  • Figure 3: Quran Voice Architecture for the second crowdsourcing task
  • Figure 4: Training session for validate Verse Correctness task on Quran Voice
  • Figure 5: Annotation Aggregation for Validate Verse Correctness task
  • ...and 2 more figures