Table of Contents
Fetching ...

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani

TL;DR

Masader addresses the lack of a centralized Arabic NLP data catalogue by building the largest public dataset repository with 200 resources annotated across 25 attributes. It implements a five-step methodology—resource discovery, filtering, metadata annotation, verification, and analysis—underpinned by a five-part metadata taxonomy and a practical Google Sheets-based workflow, plus a web interface for discovery. The paper analyzes the Arabic NLP data landscape, highlighting growth in publications, accessibility trends, licensing gaps, dialect coverage, and common tasks, and proposes concrete recommendations to improve data availability, documentation, and governance. Overall, Masader provides a scalable framework for metadata annotation and dataset discovery that can be extended to other languages and informs best practices for open data in NLP.

Abstract

The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

TL;DR

Masader addresses the lack of a centralized Arabic NLP data catalogue by building the largest public dataset repository with 200 resources annotated across 25 attributes. It implements a five-step methodology—resource discovery, filtering, metadata annotation, verification, and analysis—underpinned by a five-part metadata taxonomy and a practical Google Sheets-based workflow, plus a web interface for discovery. The paper analyzes the Arabic NLP data landscape, highlighting growth in publications, accessibility trends, licensing gaps, dialect coverage, and common tasks, and proposes concrete recommendations to improve data availability, documentation, and governance. Overall, Masader provides a scalable framework for metadata annotation and dataset discovery that can be extended to other languages and informs best practices for open data in NLP.

Abstract

The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.

Paper Structure

This paper contains 36 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Metadata schema for Arabic NLP resources.
  • Figure 2: Example demonstrates the annotation of the metadata on the Shami dataset abu-kwaik-etal-2018-shami. The subsets tag represents the dialects and each subset (For example, Jordanian) inherits all the metadata from the superset Shami, except the volume.
  • Figure 3: The count of publications across conferences, journals, preprints and workshops.
  • Figure 4: Dialects representation across datasets.
  • Figure 5: Tasks' histogram. We only show the tasks that appeared more than once in papers
  • ...and 7 more figures