ADI-20: Arabic Dialect Identification dataset and models
Haroun Elleuch, Salima Mdhaffar, Yannick Estève, Fethi Bougares
TL;DR
ADI-20 extends the Arabic Dialect Identification dataset to 20 dialects, including MSA, enabling systematic study of how training data size and model capacity affect dialect identification performance. The paper compares ECAPA-TDNN and Whisper-based encoders, showing that larger Whisper models and data-rich configurations yield strong gains, with Whisper-large + encoder freezing and augmentation delivering top results. It demonstrates that competitive ADI performance can be achieved with about 53 hours of per-dialect training data and provides zero-shot transfer insights via Casablanca. By releasing the data, models, and training recipes, the work supports reproducibility and further research in ADI, with future directions including deeper ECAPA-TDNN exploration and city-level dialect identification.
Abstract
We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.
