Table of Contents
Fetching ...

LegalRAG: A Hybrid RAG System for Multilingual Legal Information Retrieval

Muhammad Rafsan Kabir, Rafeed Mohammad Sultan, Fuad Rahman, Mohammad Ruhul Amin, Sifat Momen, Nabeel Mohammed, Shafin Rahman

TL;DR

This work tackles the problem of extracting precise legal information from bilingual Bangla–English government documents by developing a specialized multilingual Retrieval-Augmented Generation (RAG) QA framework for Bangladesh Police Gazettes. It contrasts a Vanilla RAG pipeline with an Advanced RAG approach that adds a relevance-checking and query-refinement stage, using multiple LLMs for generation. Experiments on a diverse 168 QA-pair dataset show that the Advanced RAG consistently improves retrieval accuracy and answer quality, particularly in handling multilingual and unstructured gazette content. The study demonstrates the feasibility and impact of applying advanced RAG techniques to low-resource legal documents, with implications for accessible, AI-assisted legal information retrieval in regulatory domains.

Abstract

Natural Language Processing (NLP) and computational linguistic techniques are increasingly being applied across various domains, yet their use in legal and regulatory tasks remains limited. To address this gap, we develop an efficient bilingual question-answering framework for regulatory documents, specifically the Bangladesh Police Gazettes, which contain both English and Bangla text. Our approach employs modern Retrieval Augmented Generation (RAG) pipelines to enhance information retrieval and response generation. In addition to conventional RAG pipelines, we propose an advanced RAG-based approach that improves retrieval performance, leading to more precise answers. This system enables efficient searching for specific government legal notices, making legal information more accessible. We evaluate both our proposed and conventional RAG systems on a diverse test set on Bangladesh Police Gazettes, demonstrating that our approach consistently outperforms existing methods across all evaluation metrics.

LegalRAG: A Hybrid RAG System for Multilingual Legal Information Retrieval

TL;DR

This work tackles the problem of extracting precise legal information from bilingual Bangla–English government documents by developing a specialized multilingual Retrieval-Augmented Generation (RAG) QA framework for Bangladesh Police Gazettes. It contrasts a Vanilla RAG pipeline with an Advanced RAG approach that adds a relevance-checking and query-refinement stage, using multiple LLMs for generation. Experiments on a diverse 168 QA-pair dataset show that the Advanced RAG consistently improves retrieval accuracy and answer quality, particularly in handling multilingual and unstructured gazette content. The study demonstrates the feasibility and impact of applying advanced RAG techniques to low-resource legal documents, with implications for accessible, AI-assisted legal information retrieval in regulatory domains.

Abstract

Natural Language Processing (NLP) and computational linguistic techniques are increasingly being applied across various domains, yet their use in legal and regulatory tasks remains limited. To address this gap, we develop an efficient bilingual question-answering framework for regulatory documents, specifically the Bangladesh Police Gazettes, which contain both English and Bangla text. Our approach employs modern Retrieval Augmented Generation (RAG) pipelines to enhance information retrieval and response generation. In addition to conventional RAG pipelines, we propose an advanced RAG-based approach that improves retrieval performance, leading to more precise answers. This system enables efficient searching for specific government legal notices, making legal information more accessible. We evaluate both our proposed and conventional RAG systems on a diverse test set on Bangladesh Police Gazettes, demonstrating that our approach consistently outperforms existing methods across all evaluation metrics.

Paper Structure

This paper contains 13 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A sample response generated by the conventional RAG pipeline (Vanilla RAG) and our proposed pipeline (Advanced RAG) for a given user query. Our proposed advanced RAG pipeline improves the response by retrieving an additional relevant text chunk for the LLM while eliminating an irrelevant one retrieved by the Vanilla RAG pipeline. The orange symbols indicate the English translation of the Bangla texts.
  • Figure 2: (A) Distribution of question-answer pair domains in the curated evaluation dataset. (B) Language distribution in the bilingual regulatory document (Bangladesh Police Gazettes).
  • Figure 3: Proposed RAG pipeline for multilingual legal document question-answering. The yellow box highlights the Relevance Check and Query Refinement processes, which are introduced in our proposed novel framework to enhance the existing vanilla RAG pipeline.
  • Figure 4: Interpretation of human evaluation scores (1–5).
  • Figure 5: The impact of (a) sampling temperature and (b) prompt language on the responses generated by the Advanced RAG.