Table of Contents
Fetching ...

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Mohammad Zbeeb, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib, Issam Lakkis, Ammar Mohanna, Bernard Ghanem

TL;DR

AraLingBench addresses the need for linguistically grounded evaluation of Arabic LLMs by delivering a fully human-annotated benchmark focused on five core linguistic categories. The study evaluates 35 Arabic and bilingual LLMs, revealing that models often excel at surface tasks yet struggle with grammar, morphology, and syntax, indicating a gap between knowledge-based benchmarks and true linguistic competence. By analyzing inter-category correlations, cross-benchmark alignment, and difficulty effects, AraLingBench provides a diagnostic framework to guide linguistically informed model development. The benchmark and its code are publicly available to support ongoing progress in authentic Arabic language understanding.

Abstract

We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

TL;DR

AraLingBench addresses the need for linguistically grounded evaluation of Arabic LLMs by delivering a fully human-annotated benchmark focused on five core linguistic categories. The study evaluates 35 Arabic and bilingual LLMs, revealing that models often excel at surface tasks yet struggle with grammar, morphology, and syntax, indicating a gap between knowledge-based benchmarks and true linguistic competence. By analyzing inter-category correlations, cross-benchmark alignment, and difficulty effects, AraLingBench provides a diagnostic framework to guide linguistically informed model development. The benchmark and its code are publicly available to support ongoing progress in authentic Arabic language understanding.

Abstract

We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

Paper Structure

This paper contains 17 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Sample Questions from AraLingBench. Example items illustrating the five linguistic categories: grammar, morphology, spelling, reading comprehension, and syntax. Each question targets a distinct aspect of Arabic linguistic competence and is crafted by expert annotators to assess genuine linguistic understanding.
  • Figure 2: Overview of AraLingBench. Category balance, difficulty distribution, question formats, and answer position frequencies. The benchmark maintains balanced coverage across linguistic categories and difficulty levels.
  • Figure 3: Category-level accuracy distribution. Models perform best on Spelling and Reading Comprehension, with Syntax remaining the most difficult category.
  • Figure 4: Inter-category correlations. Grammar and Morphology show the strongest relationship ($r = 0.80$), while Syntax remains comparatively independent, suggesting distinct representational mechanisms.
  • Figure 5: Cross-benchmark correlations. Pearson coefficients between AraLingBench and seven major Arabic benchmarks reveal strong alignment with language understanding tasks but weak or negative correlation with retrieval-augmented systems.
  • ...and 4 more figures