On the importance of Data Scale in Pretraining Arabic Language Models
Abbas Ghaddar, Philippe Langlais, Mehdi Rezagholizadeh, Boxing Chen
TL;DR
The paper addresses whether pretraining data scale is the key determinant of performance for Arabic pretrained language models, proposing that data size and quality trump architecture or model size. It retrains state-of-the-art Arabic BERT-base and T5-base variants on a mega-scale, filtered Arabic corpus (512GB from 90 Common Crawl shards) to create JABERv2, JABERv2-6L, AT5Sv2, and AT5Bv2, then evaluates on ALUE and ORCA benchmarks. A central finding is that data-scale effects are the dominant contributor to performance, with encoder-decoder models benefiting more from additional data than encoder-only ones, and with significant gains when expanding data by a factor of four. The work reports new state-of-the-art results on ORCA and competitive results on ALUE, and publicly releases models and code, arguing for mega-scale high-quality Arabic data as a prerequisite for robust Arabic NLP and future LLM development. Limitations include the lack of generator-focused benchmarks and the absence of decoder-only experiments, highlighting avenues for future research in Arabic generative modeling.
Abstract
Pretraining monolingual language models have been proven to be vital for performance in Arabic Natural Language Processing (NLP) tasks. In this paper, we conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs). More precisely, we reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora. We have significantly improved the performance of the leading Arabic encoder-only BERT-base and encoder-decoder T5-base models on the ALUE and ORCA leaderboards, thereby reporting state-of-the-art results in their respective model categories. In addition, our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors. Our models and source code are publicly available at https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/JABER-PyTorch.
