Table of Contents
Fetching ...

SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora

Faisal Qarah

TL;DR

SaudiBERT introduces a monodialect Arabic LLM pretrained exclusively on Saudi dialect text, addressing the scarcity of large Saudi-specific resources. It is trained from scratch on two novel corpora, STMC and SFC, using a MLM objective with a 75k SentencePiece vocabulary and 12-epoch pretraining, achieving state-of-the-art results on 11 Saudi-dialect downstream tasks. The work demonstrates the advantage of domain-specific pretraining for dialectal NLP and provides valuable resources (STMC, SFC-mini, SaudiBERT) to the research community. The findings have practical implications for sentiment analysis and broader Arabic NLP applications in Saudi contexts, supporting education, business, and social media analytics.

Abstract

In this paper, we introduce SaudiBERT, a monodialect Arabic language model pretrained exclusively on Saudi dialectal text. To demonstrate the model's effectiveness, we compared SaudiBERT with six different multidialect Arabic language models across 11 evaluation datasets, which are divided into two groups: sentiment analysis and text classification. SaudiBERT achieved average F1-scores of 86.15\% and 87.86\% in these groups respectively, significantly outperforming all other comparative models. Additionally, we present two novel Saudi dialectal corpora: the Saudi Tweets Mega Corpus (STMC), which contains over 141 million tweets in Saudi dialect, and the Saudi Forums Corpus (SFC), which includes 15.2 GB of text collected from five Saudi online forums. Both corpora are used in pretraining the proposed model, and they are the largest Saudi dialectal corpora ever reported in the literature. The results confirm the effectiveness of SaudiBERT in understanding and analyzing Arabic text expressed in Saudi dialect, achieving state-of-the-art results in most tasks and surpassing other language models included in the study. SaudiBERT model is publicly available on \url{https://huggingface.co/faisalq/SaudiBERT}.

SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora

TL;DR

SaudiBERT introduces a monodialect Arabic LLM pretrained exclusively on Saudi dialect text, addressing the scarcity of large Saudi-specific resources. It is trained from scratch on two novel corpora, STMC and SFC, using a MLM objective with a 75k SentencePiece vocabulary and 12-epoch pretraining, achieving state-of-the-art results on 11 Saudi-dialect downstream tasks. The work demonstrates the advantage of domain-specific pretraining for dialectal NLP and provides valuable resources (STMC, SFC-mini, SaudiBERT) to the research community. The findings have practical implications for sentiment analysis and broader Arabic NLP applications in Saudi contexts, supporting education, business, and social media analytics.

Abstract

In this paper, we introduce SaudiBERT, a monodialect Arabic language model pretrained exclusively on Saudi dialectal text. To demonstrate the model's effectiveness, we compared SaudiBERT with six different multidialect Arabic language models across 11 evaluation datasets, which are divided into two groups: sentiment analysis and text classification. SaudiBERT achieved average F1-scores of 86.15\% and 87.86\% in these groups respectively, significantly outperforming all other comparative models. Additionally, we present two novel Saudi dialectal corpora: the Saudi Tweets Mega Corpus (STMC), which contains over 141 million tweets in Saudi dialect, and the Saudi Forums Corpus (SFC), which includes 15.2 GB of text collected from five Saudi online forums. Both corpora are used in pretraining the proposed model, and they are the largest Saudi dialectal corpora ever reported in the literature. The results confirm the effectiveness of SaudiBERT in understanding and analyzing Arabic text expressed in Saudi dialect, achieving state-of-the-art results in most tasks and surpassing other language models included in the study. SaudiBERT model is publicly available on \url{https://huggingface.co/faisalq/SaudiBERT}.
Paper Structure (20 sections, 5 equations, 2 figures, 4 tables)

This paper contains 20 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the transformer model architecture vaswani2017attention
  • Figure 2: An illustration of BERT fine-tuning process for text classification devlin2018bert.