Table of Contents
Fetching ...

From Words to Proverbs: Evaluating LLMs Linguistic and Cultural Competence in Saudi Dialects with Absher

Renad Al-Monef, Hassan Alhuzali, Nora Alturayeif, Ashwag Alasmari

TL;DR

Absher introduces a large-scale, fine-grained benchmark to evaluate LLMs on Saudi dialects and embedded cultural knowledge across six task types, addressing gaps in dialect-aware Arabic NLP evaluation. The benchmark combines a data pipeline from Moajam, structured prompts, and dual-stage human validation to generate over 18,000 questions spanning regional dialects. Zero-shot evaluations of multiple open LLMs reveal substantial cross-dialect variability, with multilingual models often outperforming Arabic-native ones on general tasks while native models excel in proverb interpretation. The work highlights the need for dialect-rich training data, balanced regional representation, and culturally aligned evaluation to build more inclusive Arabic NLP systems.

Abstract

As large language models (LLMs) become increasingly central to Arabic NLP applications, evaluating their understanding of regional dialects and cultural nuances is essential, particularly in linguistically diverse settings like Saudi Arabia. This paper introduces Absher, a comprehensive benchmark specifically designed to assess LLMs performance across major Saudi dialects. \texttt{Absher} comprises over 18,000 multiple-choice questions spanning six distinct categories: Meaning, True/False, Fill-in-the-Blank, Contextual Usage, Cultural Interpretation, and Location Recognition. These questions are derived from a curated dataset of dialectal words, phrases, and proverbs sourced from various regions of Saudi Arabia. We evaluate several state-of-the-art LLMs, including multilingual and Arabic-specific models. We also provide detailed insights into their capabilities and limitations. Our results reveal notable performance gaps, particularly in tasks requiring cultural inference or contextual understanding. Our findings highlight the urgent need for dialect-aware training and culturally aligned evaluation methodologies to improve LLMs performance in real-world Arabic applications.

From Words to Proverbs: Evaluating LLMs Linguistic and Cultural Competence in Saudi Dialects with Absher

TL;DR

Absher introduces a large-scale, fine-grained benchmark to evaluate LLMs on Saudi dialects and embedded cultural knowledge across six task types, addressing gaps in dialect-aware Arabic NLP evaluation. The benchmark combines a data pipeline from Moajam, structured prompts, and dual-stage human validation to generate over 18,000 questions spanning regional dialects. Zero-shot evaluations of multiple open LLMs reveal substantial cross-dialect variability, with multilingual models often outperforming Arabic-native ones on general tasks while native models excel in proverb interpretation. The work highlights the need for dialect-rich training data, balanced regional representation, and culturally aligned evaluation to build more inclusive Arabic NLP systems.

Abstract

As large language models (LLMs) become increasingly central to Arabic NLP applications, evaluating their understanding of regional dialects and cultural nuances is essential, particularly in linguistically diverse settings like Saudi Arabia. This paper introduces Absher, a comprehensive benchmark specifically designed to assess LLMs performance across major Saudi dialects. \texttt{Absher} comprises over 18,000 multiple-choice questions spanning six distinct categories: Meaning, True/False, Fill-in-the-Blank, Contextual Usage, Cultural Interpretation, and Location Recognition. These questions are derived from a curated dataset of dialectal words, phrases, and proverbs sourced from various regions of Saudi Arabia. We evaluate several state-of-the-art LLMs, including multilingual and Arabic-specific models. We also provide detailed insights into their capabilities and limitations. Our results reveal notable performance gaps, particularly in tasks requiring cultural inference or contextual understanding. Our findings highlight the urgent need for dialect-aware training and culturally aligned evaluation methodologies to improve LLMs performance in real-world Arabic applications.

Paper Structure

This paper contains 25 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overview of Absher, covering all regions of Saudi Arabia (five specific regions and a general category) and presents representative examples from its three types of content: words, phrases, and proverbs. The benchmark also spans all six question types: Meaning, True/False, Fill-in-the-Blank, Contextual Usage, Cultural Interpretation, and Location Recognition.
  • Figure 2: The overall pipeline of constructing Absher benchmark.
  • Figure 3: An illustration of the different question types. The first line in each example indicates the task type, the second line presents the question, and the subsequent lines list the answer options. The correct answers are marked in green.
  • Figure 4: Model responses to dialect-based questions from different Saudi regions. Each example shows the question, its regional, and the answers from six LLMs.
  • Figure 5: Average Accuracy for Different Models by Content Type
  • ...and 2 more figures