Table of Contents
Fetching ...

LRAGE: Legal Retrieval Augmented Generation Evaluation Tool

Minhu Park, Hongseok Oh, Eunkyung Choi, Wonseok Hwang

TL;DR

The paper addresses the challenge of evaluating retrieval-augmented generation (RAG) in the legal domain. It introduces LRAGE, an open-source platform that extends the Language Model Evaluation Harness with Pyserini-based retrievers and a rubric-based LLM-as-a-Judge, enabling holistic cross-component analysis. It provides pre-configured legal benchmarks and data (LegalBench, KBL, LawBench) and datasets (Pile-of-Law) accessible via GUI/CLI for immediate experimentation. Experiments across multilingual benchmarks show that overall RAG performance depends on corpus, retriever, reranker, LLM backbone, and rubric design, underscoring the need for domain-specific evaluation tooling and its practical utility for safer legal AI deployment.

Abstract

Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at https://github.com/hoorangyee/LRAGE.

LRAGE: Legal Retrieval Augmented Generation Evaluation Tool

TL;DR

The paper addresses the challenge of evaluating retrieval-augmented generation (RAG) in the legal domain. It introduces LRAGE, an open-source platform that extends the Language Model Evaluation Harness with Pyserini-based retrievers and a rubric-based LLM-as-a-Judge, enabling holistic cross-component analysis. It provides pre-configured legal benchmarks and data (LegalBench, KBL, LawBench) and datasets (Pile-of-Law) accessible via GUI/CLI for immediate experimentation. Experiments across multilingual benchmarks show that overall RAG performance depends on corpus, retriever, reranker, LLM backbone, and rubric design, underscoring the need for domain-specific evaluation tooling and its practical utility for safer legal AI deployment.

Abstract

Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at https://github.com/hoorangyee/LRAGE.

Paper Structure

This paper contains 28 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Comparison of conventional RAG evaluation pipeline (bottom) and the LRAGE framework (up) where each process is seamlessly integrated.
  • Figure 2: System diagram
  • Figure 3: GUI of LRAGE. It consists of six tabs: Task (top-left), Model (top-center), Generation Parameters (top-right), Retriever (bottom-left), LLM-as-a-Judge (bottom-center), and a result tab(bottom-right). Each configuration tab allows users to define settings, which are then used in the final tab to perform experiments and immediately view the results.