LRAGE: Legal Retrieval Augmented Generation Evaluation Tool
Minhu Park, Hongseok Oh, Eunkyung Choi, Wonseok Hwang
TL;DR
The paper addresses the challenge of evaluating retrieval-augmented generation (RAG) in the legal domain. It introduces LRAGE, an open-source platform that extends the Language Model Evaluation Harness with Pyserini-based retrievers and a rubric-based LLM-as-a-Judge, enabling holistic cross-component analysis. It provides pre-configured legal benchmarks and data (LegalBench, KBL, LawBench) and datasets (Pile-of-Law) accessible via GUI/CLI for immediate experimentation. Experiments across multilingual benchmarks show that overall RAG performance depends on corpus, retriever, reranker, LLM backbone, and rubric design, underscoring the need for domain-specific evaluation tooling and its practical utility for safer legal AI deployment.
Abstract
Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at https://github.com/hoorangyee/LRAGE.
