Table of Contents
Fetching ...

DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

Malik H. Altakrori, Nizar Habash, Abdelhakim Freihat, Younes Samih, Kirill Chirkunov, Muhammed AbuOdeh, Radu Florian, Teresa Lynn, Preslav Nakov, Alham Fikri Aji

TL;DR

DialectalArabicMMLU creates a first large-scale, human-curated benchmark for Arabic dialect understanding by translating 3K ENG QA pairs into five dialects, yielding 15K dialect-specific items across 32 domains (21K+ with ENG/MSA). It evaluates 19 open-weight Arabic/ multilingual LLMs under default, oracle, and dialect-identification prompts, revealing substantial, dialect-dependent gaps in QA performance compared with MSA and English. The study finds that explicit dialect conditioning does not reliably boost performance and that a model's dialect-identification ability only moderately correlates with QA success. It also shows that translating dialectal questions to English can improve QA outcomes, while translating to MSA generally erodes gains, highlighting path-dependent translation effects and the need for dialect-aware training and data.

Abstract

We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.

DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

TL;DR

DialectalArabicMMLU creates a first large-scale, human-curated benchmark for Arabic dialect understanding by translating 3K ENG QA pairs into five dialects, yielding 15K dialect-specific items across 32 domains (21K+ with ENG/MSA). It evaluates 19 open-weight Arabic/ multilingual LLMs under default, oracle, and dialect-identification prompts, revealing substantial, dialect-dependent gaps in QA performance compared with MSA and English. The study finds that explicit dialect conditioning does not reliably boost performance and that a model's dialect-identification ability only moderately correlates with QA success. It also shows that translating dialectal questions to English can improve QA outcomes, while translating to MSA generally erodes gains, highlighting path-dependent translation effects and the need for dialect-aware training and data.

Abstract

We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.

Paper Structure

This paper contains 22 sections, 9 tables.