Assessing the Reliability of Large Language Models in the Bengali Legal Context: A Comparative Evaluation Using LLM-as-Judge and Legal Experts

Sabik Aftahee; A. F. M. Farhad; Arpita Mallik; Ratnajit Dhar; Jawadul Karim; Nahiyan Bin Noor; Ishmam Ahmed Solaiman

Assessing the Reliability of Large Language Models in the Bengali Legal Context: A Comparative Evaluation Using LLM-as-Judge and Legal Experts

Sabik Aftahee, A. F. M. Farhad, Arpita Mallik, Ratnajit Dhar, Jawadul Karim, Nahiyan Bin Noor, Ishmam Ahmed Solaiman

TL;DR

This study tackles the reliability of Bengali legal AI by evaluating four state-of-the-art LLMs on authentic Bangladeshi legal questions using a dual framework: an LLM-as-Judge and professional lawyers. It combines automated metrics with multi-dimensional expert assessments to gauge factual accuracy, legal safety, completeness, and clarity, revealing both strengths and dangerous gaps such as hallucinations. The results show AI can deliver high-quality, organized legal guidance but requires rigorous expert oversight and robust safeguards before deployment in public legal consultation. The work underscores the value of a hybrid evaluation approach and highlights practical paths, including domain-specific fine-tuning and retrieval-augmented mechanisms, to responsibly integrate AI into Bangladesh’s legal aid landscape.

Abstract

Accessing legal help in Bangladesh is hard. People face high fees, complex legal language, a shortage of lawyers, and millions of unresolved court cases. Generative AI models like OpenAI GPT-4.1 Mini, Gemini 2.0 Flash, Meta Llama 3 70B, and DeepSeek R1 could potentially democratize legal assistance by providing quick and affordable legal advice. In this study, we collected 250 authentic legal questions from the Facebook group "Know Your Rights," where verified legal experts regularly provide authoritative answers. These questions were subsequently submitted to four four advanced AI models and responses were generated using a consistent, standardized prompt. A comprehensive dual evaluation framework was employed, in which a state-of-the-art LLM model served as a judge, assessing each AI-generated response across four critical dimensions: factual accuracy, legal appropriateness, completeness, and clarity. Following this, the same set of questions was evaluated by three licensed Bangladeshi legal professionals according to the same criteria. In addition, automated evaluation metrics, including BLEU scores, were applied to assess response similarity. Our findings reveal a complex landscape where AI models frequently generate high-quality, well-structured legal responses but also produce dangerous misinformation, including fabricated case citations, incorrect legal procedures, and potentially harmful advice. These results underscore the critical need for rigorous expert validation and comprehensive safeguards before AI systems can be safely deployed for legal consultation in Bangladesh.

Assessing the Reliability of Large Language Models in the Bengali Legal Context: A Comparative Evaluation Using LLM-as-Judge and Legal Experts

TL;DR

Abstract

Assessing the Reliability of Large Language Models in the Bengali Legal Context: A Comparative Evaluation Using LLM-as-Judge and Legal Experts

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)