MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

Abdessalam Bouchekif; Shahd Gaben; Samer Rashwani; Somaya Eltanbouly; Mutaz Al-Khatib; Heba Sbahi; Mohammed Ghaly; Emad Mohamed

MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, Emad Mohamed

TL;DR

This work introduces MAWARITH, a large-scale annotated dataset of 12,500 Arabic inheritance cases to train and evaluate the full reasoning chain, including intermediate legal decisions and justifications based on classical juristic sources and established inheritance rules, as well as exact share calculations.

Abstract

Islamic inheritance law ('ilm al-mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce MAWARITH, a large-scale annotated dataset of 12,500 Arabic inheritance cases to train and evaluate the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (hajb) and allocation rules, and (iii) computing exact inheritance shares. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, MAWARITH supports the full reasoning chain and provides step-by-step solutions, including intermediate legal decisions and justifications based on classical juristic sources and established inheritance rules, as well as exact share calculations. To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate five LLMs in a zero-shot setting. Gemini-2.5-flash achieves about 90% MIR-E on both validation and test, while Fanar-C, Fanar-Sadiq, LLaMA 3, and Qwen 3 remain below 50%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as 'awl and radd. The MAWARITH dataset is publicly available at https://github.com/bouchekif/inheritance_evaluation.

MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

TL;DR

Abstract

Paper Structure (25 sections, 10 equations, 4 figures, 6 tables)

This paper contains 25 sections, 10 equations, 4 figures, 6 tables.

Introduction
Related Work
Background: Islamic Inheritance Law
Data Description
Data Collection
Reasoning Representation
Dataset Overview
Evaluation Metric: MIR-E
Notation.
Heirs and Blocking Identification.
Share Assignment.
Adjustment
Overall MIR-E Score.
Experiments and Results
Experimental Setup
...and 10 more sections

Figures (4)

Figure 1: Distribution of cases by number of distinct heir categories in the training and test splits. Heir categories represent distinct kinship types (e.g., father, mother, siblings). For dataset analysis, multiple individuals within the same kinship group are grouped into a single category." The training and test sets share the same distributional profile, differing only in scale.
Figure 2: Cumulative pipeline success rates across stages (Step1--Step4) for each model.
Figure 3: Blocking errors by genealogical heir level across models. For each model, the left bar shows false blocking (FB: wrongly blocked eligible heirs) and the right bar shows false eligibility (FE: added eligible heirs). Colors indicate heir levels (1--8), as defined in Appendix \ref{['appendix/appendix_heirs']}
Figure 4: Frequency distribution of all heir types in the corpus, ordered by kinship proximity.

MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

TL;DR

Abstract

MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (4)