LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain

Nicholas Pipitone; Ghita Houir Alami

LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain

Nicholas Pipitone, Ghita Houir Alami

TL;DR

LegalBench-RAG fills a critical gap by providing the first retrieval-focused benchmark for legal RAG systems, enabling precise evaluation of snippet-level text retrieval within large legal corpora. It constructs the dataset by tracing LegalBench annotations to exact source spans across four legal corpora, yielding 6,889 QA pairs (and a lighter LegalBench-RAG-mini with 776 queries) and highlighting the importance of exact-span retrieval over whole-document retrieval. Through extensive experiments, the authors show that Recursive Text Character Splitter (RTCS) without a reranker delivers the best retrieval performance, while general-purpose rerankers may underperform on legal text, underscoring the need for domain-specific tooling. The dataset and findings offer a practical platform for industry and researchers to compare retrieval approaches, improve RAG accuracy in the legal domain, and drive development of specialized legal retrieval components.

Abstract

Retrieval-Augmented Generation (RAG) systems are showing promising potential, and are becoming increasingly relevant in AI-powered legal applications. Existing benchmarks, such as LegalBench, assess the generative capabilities of Large Language Models (LLMs) in the legal domain, but there is a critical gap in evaluating the retrieval component of RAG systems. To address this, we introduce LegalBench-RAG, the first benchmark specifically designed to evaluate the retrieval step of RAG pipelines within the legal space. LegalBench-RAG emphasizes precise retrieval by focusing on extracting minimal, highly relevant text segments from legal documents. These highly relevant snippets are preferred over retrieving document IDs, or large sequences of imprecise chunks, both of which can exceed context window limitations. Long context windows cost more to process, induce higher latency, and lead LLMs to forget or hallucinate information. Additionally, precise results allow LLMs to generate citations for the end user. The LegalBench-RAG benchmark is constructed by retracing the context used in LegalBench queries back to their original locations within the legal corpus, resulting in a dataset of 6,858 query-answer pairs over a corpus of over 79M characters, entirely human-annotated by legal experts. We also introduce LegalBench-RAG-mini, a lightweight version for rapid iteration and experimentation. By providing a dedicated benchmark for legal retrieval, LegalBench-RAG serves as a critical tool for companies and researchers focused on enhancing the accuracy and performance of RAG systems in the legal domain. The LegalBench-RAG dataset is publicly available at https://github.com/zeroentropy-cc/legalbenchrag.

LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain

TL;DR

Abstract

LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain

Authors

TL;DR

Abstract

Table of Contents

Figures (5)