Table of Contents
Fetching ...

Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

Mohammed Alkhowaiter, Norah Alshahrani, Saied Alshahrani, Reem I. Masoud, Alaa Alzahrani, Deema Alnuhait, Emad A. Alghamdi, Khalid Almubarak

TL;DR

This paper tackles the scarcity and uneven quality of Arabic post-training datasets used to align LLMs with human intent. It presents a HF Hub–driven methodology to collect metadata, map tasks to capabilities, and evaluate datasets across six criteria, enabling a transparent, reproducible landscape of Arabic post-training resources. The study reveals major gaps in task diversity, documentation, adoption, and cultural/safety alignment, and proposes concrete, actionable guidelines to address them, including dialectal data, native content, and hybrid annotation approaches. By releasing open-source demo tools, it aims to accelerate the development and evaluation of culturally aware, robust Arabic LLMs with better downstream impact.

Abstract

Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., Persona and System Prompts); (3) Alignment (e.g., Cultural, Safety, Ethics, and Fairness); and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic post-training datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabic-centric LLMs and applications while providing concrete recommendations for future efforts in Arabic post-training dataset development.

Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

TL;DR

This paper tackles the scarcity and uneven quality of Arabic post-training datasets used to align LLMs with human intent. It presents a HF Hub–driven methodology to collect metadata, map tasks to capabilities, and evaluate datasets across six criteria, enabling a transparent, reproducible landscape of Arabic post-training resources. The study reveals major gaps in task diversity, documentation, adoption, and cultural/safety alignment, and proposes concrete, actionable guidelines to address them, including dialectal data, native content, and hybrid annotation approaches. By releasing open-source demo tools, it aims to accelerate the development and evaluation of culturally aware, robust Arabic LLMs with better downstream impact.

Abstract

Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., Persona and System Prompts); (3) Alignment (e.g., Cultural, Safety, Ethics, and Fairness); and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic post-training datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabic-centric LLMs and applications while providing concrete recommendations for future efforts in Arabic post-training dataset development.

Paper Structure

This paper contains 22 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: General Processing Pipeline for Arabic Post-Training Dataset Collection, Filtering, and Evaluation.
  • Figure 2: Distribution of datasets across tasks. Labels include the percentage of datasets in each task. Tasks with no datasets are shown for the sake of completeness.
  • Figure 3: Overview of dataset quality across tasks. The subfigures present quality indicators including documentation, popularity, adoption, recency, licensing transparency, and scientific contribution. While the full taxonomy includes 12 tasks, we report results for the 9 tasks with available datasets. Persona & System Prompts, and Function Call, Code Generation, and Official Documentation are excluded as no datasets were available for those tasks.
  • Figure 4: Range of dataset sizes per task (log scale). Each horizontal bar represents the minimum and maximum number of rows for datasets within a task, with red, blue, and black points denoting the minimum, maximum, and mean sizes, respectively. The wide variation in size highlights disparities in dataset availability and scale across post-training tasks. Although there are 12 tasks, here we only present the size of datasets with available data (n= 9). This figure reveals that dataset sizes vary dramatically not only across tasks but also within the same task category. Some tasks, such as Summarization and Translation, contain datasets ranging from a few dozen rows to over 10 billion. This high variance makes aggregate measures like the mean misleading; therefore, we emphasize range-based visualizations over summary statistics when discussing dataset scale.