Table of Contents
Fetching ...

Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir

TL;DR

Can LLMs Write Faithfully? addresses whether LLMs can generate faithful Islamic content with theological accuracy and proper citations. The authors introduce a dual-agent evaluation framework: a quantitative citation-verification agent scoring across six dimensions and a qualitative comparison agent assessing tone, structure, depth, and comparative framing, applied to GPT-4o, Ansari AI, and Fanar on 50 prompts derived from authentic Islamic blogs. Findings show GPT-4o achieves the highest Islamic accuracy and citation scores, Ansari AI close behind, while Fanar lags but introduces domain-specific innovations; all models still fall short on reliable citations and doctrinal grounding, underscoring the need for community-driven benchmarks in faith-sensitive domains. The study offers a blueprint for interpretable, auditable AI evaluation that can be extended to other high-stakes fields such as medicine, law, and journalism, promoting safer and more accountable AI-assisted guidance for diverse communities.

Abstract

Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.

Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

TL;DR

Can LLMs Write Faithfully? addresses whether LLMs can generate faithful Islamic content with theological accuracy and proper citations. The authors introduce a dual-agent evaluation framework: a quantitative citation-verification agent scoring across six dimensions and a qualitative comparison agent assessing tone, structure, depth, and comparative framing, applied to GPT-4o, Ansari AI, and Fanar on 50 prompts derived from authentic Islamic blogs. Findings show GPT-4o achieves the highest Islamic accuracy and citation scores, Ansari AI close behind, while Fanar lags but introduces domain-specific innovations; all models still fall short on reliable citations and doctrinal grounding, underscoring the need for community-driven benchmarks in faith-sensitive domains. The study offers a blueprint for interpretable, auditable AI evaluation that can be extended to other high-stakes fields such as medicine, law, and journalism, promoting safer and more accountable AI-assisted guidance for diverse communities.

Abstract

Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.

Paper Structure

This paper contains 17 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of System Design and Methodology of the proposed Dual-Agent framework for LLM-generated Islamic content verification, both quantitatively and qualitatively.
  • Figure 2: Quantitative comparison of ChatGPT, Ansari AI, and Fanar across six evaluation dimensions. ChatGPT leads in Style & Structure, and Islamic Content, Ansari AI followed closely in all dimensions, while Fanar shows lower scores and higher variability.
  • Figure 3: Performance of LLM chatbots by dimensions through qualitative analysis. Created using the verdict table produced by the qualitative agent. Positive value indicates 'Best' among three chatbots on the same prompts, and a Negative value indicates 'Worst' among the three.
  • Figure 4: Agent-based citation verification analysis for a Fanar-generated response. The system traces each Qur'anic reference, evaluates its textual and contextual accuracy, detects citation hallucinations, and provides evidence-backed justifications. This mapped example illustrates how the framework connects model outputs to reference-level verifications, facilitating an explainable assessment of citation integrity.