Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Ali Mekky; Mohamed El Zeftawy; Lara Hassan; Amr Keleg; Preslav Nakov

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov

TL;DR

The paper addresses MLADI by showing that single-label ADI data complicates multi-label training due to improper negative sampling. It introduces a pseudo-labeled MLADI dataset built from GPT-4o and 18 binary dialect acceptability classifiers guided by ALDi, and trains LahjatBERT with cardinality- and ALDi-based curriculum learning. The best LahjatBERT variant achieves macro-F1 around $0.69$ on MLADI, outperforming prior systems and demonstrating improved generalization and targeted learning strategies. This approach offers a scalable path to robust, multi-dialect identification for Arabic, with potential applicability to other multi-label linguistic tasks.

Abstract

Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at https://mohamedalaa9.github.io/lahjatbert/.

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

TL;DR

on MLADI, outperforming prior systems and demonstrating improved generalization and targeted learning strategies. This approach offers a scalable path to robust, multi-dialect identification for Arabic, with potential applicability to other multi-label linguistic tasks.

Abstract

Paper Structure (41 sections, 2 equations, 9 figures, 7 tables)

This paper contains 41 sections, 2 equations, 9 figures, 7 tables.

Introduction
MLADI Task's Setup and Previous Attempts
Previous MLADI Attempts.
Difficulties of Using Existing Datasets for Dialect Acceptability Classification
Other Countries’ Samples Are Not Always Negative Samples
Definition
Training Dynamics and Multi-label Samples
Intuition
Methodology
Findings
Usability of Training Dynamics in Flagging Wrongly Assigned Samples
Results
Moving Forward
Multi-Label ADI Dataset Creation
(1) Binary Dialect Classifiers.
...and 26 more sections

Figures (9)

Figure 1: Number of samples in each dialect after combining the NADI 2020, 2021, 2023 datasets. Samples with automatically estimated Arabic Level of Dialectness (ALDi; keleg-etal-2023-ALDi) $\leq 0.11$ are expected to be in MSA. The majority of the MSA samples are expected to be acceptable in all dialects.
Figure 2: The training dynamics for 6 binary acceptability classifiers, characterized by the mean confidence in the label across different steps/stages of the model training (y-axis), and the standard deviation of these confidence values (x-axis). Each pair shows the training dynamics' metrics for the non-MSA positive (left) and negative (right) samples of a single classifier, with the respective number of samples shown above each subplot. Note: Sample's correctness ranges are : 0 : ]0, 0.2[ : [0.2, 0.4[ : [0.4, 0.6[ : [0.6, 0.8[ : [0.8, 1[ : 1
Figure 3: Number of samples for each label cardinality according to the three pseudo-labeling methods.
Figure 4: Illustration of the cardinality-based curriculum schedule, showing the progressive introduction of higher difficulty cardinality samples. The numerical values are illustrative and do not reflect the actual dataset.
Figure B1: Distribution of ALDi scores in the dataset. Vertical dashed lines indicate the thresholds distinguishing MSA text ($a_i < 0.11$), low--medium dialectness ($0.11$--$0.44$), medium dialectness ($0.44$--$0.77$), and highly dialectal text ($a_i > 0.77$).
...and 4 more figures

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

TL;DR

Abstract

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)