Can Large Language Models Predict the Outcome of Judicial Decisions?

Mohamed Bayan Kmainasi; Ali Ezzat Shahroor; Amani Al-Ghraibah

Can Large Language Models Predict the Outcome of Judicial Decisions?

Mohamed Bayan Kmainasi, Ali Ezzat Shahroor, Amani Al-Ghraibah

TL;DR

The paper tackles Arabic Legal Judgment Prediction (LJP), a domain hampered by data scarcity and linguistic complexity, by constructing a Saudi commercial court-based LJP dataset and benchmarking open-source LLMs. It systematically compares zero-shot, one-shot, and LoRA-based fine-tuning (on LLaMA-3 models) using a mixed quantitative (BLEU, ROUGE, BERTScore) and qualitative (eight-dimension rubric) evaluation framework. Key findings show that LoRA-finetuned smaller models can match or closely approach the performance of larger models while offering substantial resource efficiency, and that instruction generalization improves with diverse prompts. The work provides public datasets, code, and trained models, delivering a practical, scalable path for Arabic legal NLP and offering a framework for future cross-domain and multilingual legal LLM research.

Abstract

Large Language Models (LLMs) have shown exceptional capabilities in Natural Language Processing (NLP) across diverse domains. However, their application in specialized tasks such as Legal Judgment Prediction (LJP) for low-resource languages like Arabic remains underexplored. In this work, we address this gap by developing an Arabic LJP dataset, collected and preprocessed from Saudi commercial court judgments. We benchmark state-of-the-art open-source LLMs, including LLaMA-3.2-3B and LLaMA-3.1-8B, under varying configurations such as zero-shot, one-shot, and fine-tuning using LoRA. Additionally, we employed a comprehensive evaluation framework that integrates both quantitative metrics (such as BLEU, ROUGE, and BERT) and qualitative assessments (including Coherence, Legal Language, Clarity, etc.) using an LLM. Our results demonstrate that fine-tuned smaller models achieve comparable performance to larger models in task-specific contexts while offering significant resource efficiency. Furthermore, we investigate the impact of fine-tuning the model on a diverse set of instructions, offering valuable insights into the development of a more human-centric and adaptable LLM. We have made the dataset, code, and models publicly available to provide a solid foundation for future research in Arabic legal NLP.

Can Large Language Models Predict the Outcome of Judicial Decisions?

TL;DR

Abstract

Can Large Language Models Predict the Outcome of Judicial Decisions?

Authors

TL;DR

Abstract

Table of Contents

Figures (2)