Table of Contents
Fetching ...

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov

TL;DR

The paper addresses MLADI by showing that single-label ADI data complicates multi-label training due to improper negative sampling. It introduces a pseudo-labeled MLADI dataset built from GPT-4o and 18 binary dialect acceptability classifiers guided by ALDi, and trains LahjatBERT with cardinality- and ALDi-based curriculum learning. The best LahjatBERT variant achieves macro-F1 around $0.69$ on MLADI, outperforming prior systems and demonstrating improved generalization and targeted learning strategies. This approach offers a scalable path to robust, multi-dialect identification for Arabic, with potential applicability to other multi-label linguistic tasks.

Abstract

Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at https://mohamedalaa9.github.io/lahjatbert/.

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

TL;DR

The paper addresses MLADI by showing that single-label ADI data complicates multi-label training due to improper negative sampling. It introduces a pseudo-labeled MLADI dataset built from GPT-4o and 18 binary dialect acceptability classifiers guided by ALDi, and trains LahjatBERT with cardinality- and ALDi-based curriculum learning. The best LahjatBERT variant achieves macro-F1 around on MLADI, outperforming prior systems and demonstrating improved generalization and targeted learning strategies. This approach offers a scalable path to robust, multi-dialect identification for Arabic, with potential applicability to other multi-label linguistic tasks.

Abstract

Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at https://mohamedalaa9.github.io/lahjatbert/.
Paper Structure (41 sections, 2 equations, 9 figures, 7 tables)

This paper contains 41 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Number of samples in each dialect after combining the NADI 2020, 2021, 2023 datasets. Samples with automatically estimated Arabic Level of Dialectness (ALDi; keleg-etal-2023-ALDi) $\leq 0.11$ are expected to be in MSA. The majority of the MSA samples are expected to be acceptable in all dialects.
  • Figure 2: The training dynamics for 6 binary acceptability classifiers, characterized by the mean confidence in the label across different steps/stages of the model training (y-axis), and the standard deviation of these confidence values (x-axis). Each pair shows the training dynamics' metrics for the non-MSA positive (left) and negative (right) samples of a single classifier, with the respective number of samples shown above each subplot. Note: Sample's correctness ranges are : 0 : ]0, 0.2[ : [0.2, 0.4[ : [0.4, 0.6[ : [0.6, 0.8[ : [0.8, 1[ : 1
  • Figure 3: Number of samples for each label cardinality according to the three pseudo-labeling methods.
  • Figure 4: Illustration of the cardinality-based curriculum schedule, showing the progressive introduction of higher difficulty cardinality samples. The numerical values are illustrative and do not reflect the actual dataset.
  • Figure B1: Distribution of ALDi scores in the dataset. Vertical dashed lines indicate the thresholds distinguishing MSA text ($a_i < 0.11$), low--medium dialectness ($0.11$--$0.44$), medium dialectness ($0.44$--$0.77$), and highly dialectal text ($a_i > 0.77$).
  • ...and 4 more figures