IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Ali Abdelaal; Mohammed Nader Al Haffar; Mahmoud Fawzi; Walid Magdy

IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Ali Abdelaal, Mohammed Nader Al Haffar, Mahmoud Fawzi, Walid Magdy

Abstract

Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash). The Quran track shows the widest span (99.3\% to 32.4\%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.

IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Abstract

Paper Structure (54 sections, 2 figures, 5 tables)

This paper contains 54 sections, 2 figures, 5 tables.

Introduction
Background and Related Work
Islamic Knowledge Domains
MMLU and LLM Evaluation
Arabic NLP Benchmarks
Cultural Bias in LLMs
Islamic and Religious NLP
Data Preparation Methodology
Quran Track
Hadith Track
Preprocessing.
Question Types.
Fiqh Track
Quality Verification.
Fiqh Question Design and Bias Methodology
...and 39 more sections

Figures (2)

Figure 1: Fiqh track pipeline architectures. (a) Extraction of structured rulings from al-Jaziri's source text. (b) Generation and validation of benchmark questions from the structured corpus.
Figure 2: Model evaluation results. (a) Per-track accuracy with cross-track spread. (b) Madhab bias: dots near the centre line indicate balanced school selection.

IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Abstract

IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Authors

Abstract

Table of Contents

Figures (2)