Table of Contents
Fetching ...

Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters in Hadith Domain

Huda AlShuhayeb, Behrouz Minaei-Bidgoli, Mohammad E. Shenassa, Sayyed-Ali Hossayni

TL;DR

This work introduces Noor-Ghateh, a large, expert-annotated Arabic word segmentation dataset derived from the Hadith text Shariat al-Islam, comprising approximately 223,690 words to benchmark morphological analysis tools in the Hadith domain. It situates Noor-Ghateh among existing Arabic corpora, detailing its XML-encoded, root-prefix-suffix annotation scheme and the five tag groups used to capture morpheme structure. The authors benchmark three segmentation tools—Farasa, CAMeL, and ALP—against Noor-Ghateh and additional datasets (NAFIS and Quranic corpus), reporting varying accuracy across corpora and highlighting CAMeL and ALP as strong performers in several settings. The paper positions Noor-Ghateh as a primary Hadith segmentation benchmark and outlines future directions, including Seq-to-Seq methods with attention to further improve Arabic lexical segmentation. Overall, Noor-Ghateh provides a high-fidelity, domain-specific resource with practical implications for evaluating and advancing Arabic NLP in religious-text contexts.

Abstract

There are numerous complex and rich morphological features in the Arabic language, which are highly useful when analyzing traditional Arabic textbooks, especially in the literary and religious contexts, and help in understanding the meaning of the textbooks. Vocabulary separation means separating the word into different components, such as the root and affixes. In the morphological datasets, the variety of markers and the number of data samples help to evaluate the morphological techniques. In this paper, we present a standard dataset for analyzing the Arabic segmentation tools, which includes approximately 223,690 words from the "Shariat al-Islam" book, labeled by human experts. In terms of volume and word variety, this dataset is superior to the other Hadith Arabic datasets, to the best of our knowledge. To estimate the dataset, we applied different methods, including Farasa, Camel, and ALP, and reported the annotation quality and analyzed the benchmark specifications as well. This be

Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters in Hadith Domain

TL;DR

This work introduces Noor-Ghateh, a large, expert-annotated Arabic word segmentation dataset derived from the Hadith text Shariat al-Islam, comprising approximately 223,690 words to benchmark morphological analysis tools in the Hadith domain. It situates Noor-Ghateh among existing Arabic corpora, detailing its XML-encoded, root-prefix-suffix annotation scheme and the five tag groups used to capture morpheme structure. The authors benchmark three segmentation tools—Farasa, CAMeL, and ALP—against Noor-Ghateh and additional datasets (NAFIS and Quranic corpus), reporting varying accuracy across corpora and highlighting CAMeL and ALP as strong performers in several settings. The paper positions Noor-Ghateh as a primary Hadith segmentation benchmark and outlines future directions, including Seq-to-Seq methods with attention to further improve Arabic lexical segmentation. Overall, Noor-Ghateh provides a high-fidelity, domain-specific resource with practical implications for evaluating and advancing Arabic NLP in religious-text contexts.

Abstract

There are numerous complex and rich morphological features in the Arabic language, which are highly useful when analyzing traditional Arabic textbooks, especially in the literary and religious contexts, and help in understanding the meaning of the textbooks. Vocabulary separation means separating the word into different components, such as the root and affixes. In the morphological datasets, the variety of markers and the number of data samples help to evaluate the morphological techniques. In this paper, we present a standard dataset for analyzing the Arabic segmentation tools, which includes approximately 223,690 words from the "Shariat al-Islam" book, labeled by human experts. In terms of volume and word variety, this dataset is superior to the other Hadith Arabic datasets, to the best of our knowledge. To estimate the dataset, we applied different methods, including Farasa, Camel, and ALP, and reported the annotation quality and analyzed the benchmark specifications as well. This be
Paper Structure (16 sections, 2 figures, 4 tables)