AlcLaM: Arabic Dialectal Language Model
Murtadha Ahmed, Saghir Alfasly, Bo Wen, Jamaal Qasem, Mohammed Ahmed, Yunfeng Liu
TL;DR
AlcLaM tackles the challenge of Arabic dialect diversity by building a large dialectal corpus and pretraining a BERT-based model from scratch on 13GB of data. The authors demonstrate that dialect-aware pretraining yields strong performance across dialect identification, sentiment analysis, and hate speech detection, often surpassing multilingual and MSA-centric baselines. They validate through extensive experiments on diverse Arabic datasets, with statistically significant improvements on several tasks. The work highlights the importance of dialect-rich training data for robust Arabic NLP and provides open-source access to AlcLaM and its resources, enabling broader adoption and extension in the community.
Abstract
Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%, 10.2%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a variety of Arabic NLP tasks despite the limited training data. AlcLaM is available at GitHub https://github.com/amurtadha/Alclam and HuggingFace https://huggingface.co/rahbi.
