Table of Contents
Fetching ...

IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions

Ezieddin Elmahjub, Junaid Qadir, Abdullah Mushtaq, Rafay Naeem, Ibrahim Ghaznavi, Waleed Iqbal

TL;DR

The first systematic framework to evaluate Islamic legal reasoning in AI is offered, revealing critical gaps in tools increasingly relied on for spiritual guidance and highlighting that prompt-based methods cannot compensate for missing foundational knowledge.

Abstract

As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results highlight that prompt-based methods cannot compensate for missing foundational knowledge. IslamicLegalBench offers the first systematic framework to evaluate Islamic legal reasoning in AI, revealing critical gaps in tools increasingly relied on for spiritual guidance.

IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions

TL;DR

The first systematic framework to evaluate Islamic legal reasoning in AI is offered, revealing critical gaps in tools increasingly relied on for spiritual guidance and highlighting that prompt-based methods cannot compensate for missing foundational knowledge.

Abstract

As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results highlight that prompt-based methods cannot compensate for missing foundational knowledge. IslamicLegalBench offers the first systematic framework to evaluate Islamic legal reasoning in AI, revealing critical gaps in tools increasingly relied on for spiritual guidance.
Paper Structure (65 sections, 11 figures, 9 tables)

This paper contains 65 sections, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Comprehensive workflow of dataset creation from source Islamic legal texts to final benchmarking dataset, showing the transformation from 271 expert-curated source entries to 718 structured evaluation items under continued expert supervision.
  • Figure 2: Taxonomy of benchmark dataset tasks (T1 till T13) organized by complexity levels.
  • Figure 3: Illustration of a false premise query evaluated through an LLM-as-a-JUDGE framework. The LLM Output (generated by LLaMA-4 Maverick) falsely assumes that Abū dHanīfa required eight conditions for a valid Salām contract, while the correct jurisprudential reference specifies only six. The output accepts the false premise and hallucinates extra conditions, whereas the judge model correctly identifies the error and assigns a score of 0 (Incorrect). In the 'LLM Output' and 'Reference Answer' panels, Green highlights correctly mentioned conditions, Red indicates fabricated or hallucinated ones, and Blue marks missing but essential elements for required conditions.
  • Figure 4: Overall performance comparison of evaluated LLMs under few-shot prompting. Models are ordered by category (closed-source followed by open-source). The stacked bars decompose responses into correct (green), partially correct (orange), and incorrect (red) categories, while separate purple bars indicate hallucination rates. Performance ranges from 30.96% (Llama 3.1 8B) to 67.65% (GPT-5), revealing substantial variation in Islamic law reasoning capabilities.
  • Figure 5: Hallucination Severity Analysis. Few-shot prompting reveals substantial variation in hallucination robustness: multiple models exceed the 25% moderate-risk threshold, and several reach critical hallucination levels above 40%, particularly on higher-complexity tasks. Panel (a) shows severity by task complexity, while Panel (b) summarizes overall rankings.
  • ...and 6 more figures