Table of Contents
Fetching ...

EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora

Faisal Qarah

TL;DR

This work tackles the scarcity of dialect-specific NLP tools for Egyptian Arabic by introducing EgyBERT, a BERT-like language model pretrained on two large Egyptian dialect corpora totaling 10.4 GB (ETC 2.5 GB and EFC 7.9 GB). It trains with a masked language modeling objective using a 75k WordPiece vocabulary and evaluates against five multidialect Arabic LLMs across 10 Egyptian dialect tasks, achieving a mean accuracy of 87.33% and a mean F1 of 84.25%. The study also contributes the ETC and EFC corpora as the largest Egyptian dialect resources to date, with EFC-mini publicly available to facilitate broader research. Overall, the results demonstrate the value of dialect-focused pretraining for Arabic NLP and provide publicly accessible resources to enable further development of Egyptian dialect models.

Abstract

This study presents EgyBERT, an Arabic language model pretrained on 10.4 GB of Egyptian dialectal texts. We evaluated EgyBERT's performance by comparing it with five other multidialect Arabic language models across 10 evaluation datasets. EgyBERT achieved the highest average F1-score of 84.25% and an accuracy of 87.33%, significantly outperforming all other comparative models, with MARBERTv2 as the second best model achieving an F1-score 83.68% and an accuracy 87.19%. Additionally, we introduce two novel Egyptian dialectal corpora: the Egyptian Tweets Corpus (ETC), containing over 34.33 million tweets (24.89 million sentences) amounting to 2.5 GB of text, and the Egyptian Forums Corpus (EFC), comprising over 44.42 million sentences (7.9 GB of text) collected from various Egyptian online forums. Both corpora are used in pretraining the new model, and they are the largest Egyptian dialectal corpora to date reported in the literature. Furthermore, this is the first study to evaluate the performance of various language models on Egyptian dialect datasets, revealing significant differences in performance that highlight the need for more dialect-specific models. The results confirm the effectiveness of EgyBERT model in processing and analyzing Arabic text expressed in Egyptian dialect, surpassing other language models included in the study. EgyBERT model is publicly available on \url{https://huggingface.co/faisalq/EgyBERT}.

EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora

TL;DR

This work tackles the scarcity of dialect-specific NLP tools for Egyptian Arabic by introducing EgyBERT, a BERT-like language model pretrained on two large Egyptian dialect corpora totaling 10.4 GB (ETC 2.5 GB and EFC 7.9 GB). It trains with a masked language modeling objective using a 75k WordPiece vocabulary and evaluates against five multidialect Arabic LLMs across 10 Egyptian dialect tasks, achieving a mean accuracy of 87.33% and a mean F1 of 84.25%. The study also contributes the ETC and EFC corpora as the largest Egyptian dialect resources to date, with EFC-mini publicly available to facilitate broader research. Overall, the results demonstrate the value of dialect-focused pretraining for Arabic NLP and provide publicly accessible resources to enable further development of Egyptian dialect models.

Abstract

This study presents EgyBERT, an Arabic language model pretrained on 10.4 GB of Egyptian dialectal texts. We evaluated EgyBERT's performance by comparing it with five other multidialect Arabic language models across 10 evaluation datasets. EgyBERT achieved the highest average F1-score of 84.25% and an accuracy of 87.33%, significantly outperforming all other comparative models, with MARBERTv2 as the second best model achieving an F1-score 83.68% and an accuracy 87.19%. Additionally, we introduce two novel Egyptian dialectal corpora: the Egyptian Tweets Corpus (ETC), containing over 34.33 million tweets (24.89 million sentences) amounting to 2.5 GB of text, and the Egyptian Forums Corpus (EFC), comprising over 44.42 million sentences (7.9 GB of text) collected from various Egyptian online forums. Both corpora are used in pretraining the new model, and they are the largest Egyptian dialectal corpora to date reported in the literature. Furthermore, this is the first study to evaluate the performance of various language models on Egyptian dialect datasets, revealing significant differences in performance that highlight the need for more dialect-specific models. The results confirm the effectiveness of EgyBERT model in processing and analyzing Arabic text expressed in Egyptian dialect, surpassing other language models included in the study. EgyBERT model is publicly available on \url{https://huggingface.co/faisalq/EgyBERT}.
Paper Structure (16 sections, 2 figures, 3 tables, 2 algorithms)