LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain
Nicholas Pipitone, Ghita Houir Alami
TL;DR
LegalBench-RAG fills a critical gap by providing the first retrieval-focused benchmark for legal RAG systems, enabling precise evaluation of snippet-level text retrieval within large legal corpora. It constructs the dataset by tracing LegalBench annotations to exact source spans across four legal corpora, yielding 6,889 QA pairs (and a lighter LegalBench-RAG-mini with 776 queries) and highlighting the importance of exact-span retrieval over whole-document retrieval. Through extensive experiments, the authors show that Recursive Text Character Splitter (RTCS) without a reranker delivers the best retrieval performance, while general-purpose rerankers may underperform on legal text, underscoring the need for domain-specific tooling. The dataset and findings offer a practical platform for industry and researchers to compare retrieval approaches, improve RAG accuracy in the legal domain, and drive development of specialized legal retrieval components.
Abstract
Retrieval-Augmented Generation (RAG) systems are showing promising potential, and are becoming increasingly relevant in AI-powered legal applications. Existing benchmarks, such as LegalBench, assess the generative capabilities of Large Language Models (LLMs) in the legal domain, but there is a critical gap in evaluating the retrieval component of RAG systems. To address this, we introduce LegalBench-RAG, the first benchmark specifically designed to evaluate the retrieval step of RAG pipelines within the legal space. LegalBench-RAG emphasizes precise retrieval by focusing on extracting minimal, highly relevant text segments from legal documents. These highly relevant snippets are preferred over retrieving document IDs, or large sequences of imprecise chunks, both of which can exceed context window limitations. Long context windows cost more to process, induce higher latency, and lead LLMs to forget or hallucinate information. Additionally, precise results allow LLMs to generate citations for the end user. The LegalBench-RAG benchmark is constructed by retracing the context used in LegalBench queries back to their original locations within the legal corpus, resulting in a dataset of 6,858 query-answer pairs over a corpus of over 79M characters, entirely human-annotated by legal experts. We also introduce LegalBench-RAG-mini, a lightweight version for rapid iteration and experimentation. By providing a dedicated benchmark for legal retrieval, LegalBench-RAG serves as a critical tool for companies and researchers focused on enhancing the accuracy and performance of RAG systems in the legal domain. The LegalBench-RAG dataset is publicly available at https://github.com/zeroentropy-cc/legalbenchrag.
