Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

Mohamed Afane; Emaan Hariri; Derek Ouyang; Daniel E. Ho

Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

Mohamed Afane, Emaan Hariri, Derek Ouyang, Daniel E. Ho

TL;DR

The Statutory Research Assistant (STARA), a custom statutory research tool, and two commercial tools by Westlaw and LexisNexis marketing AI statutory survey capabilities are assessed, showing that commercial platforms fare poorly, and chart the path forward for legal RAG through concrete design principles.

Abstract

Retrieval-augmented generation (RAG) offers significant potential for legal AI, yet systematic benchmarks are sparse. Prior work introduced LaborBench to benchmark RAG models based on ostensible ground truth from an exhaustive, multi-month, manual enumeration of all U.S. state unemployment insurance requirements by U.S. Department of Labor (DOL) attorneys. That prior work found poor performance of standard RAG (70% accuracy on Boolean tasks). Here, we assess three emerging tools not previously evaluated on LaborBench: the Statutory Research Assistant (STARA), a custom statutory research tool, and two commercial tools by Westlaw and LexisNexis marketing AI statutory survey capabilities. We make five main contributions. First, we show that STARA achieves substantial performance gains, boosting accuracy to 83%. Second, we show that commercial platforms fare poorly, with accuracy of 58% (Westlaw AI) and 64% (Lexis+ AI), even worse than standard RAG. Third, we conduct a comprehensive error analysis, comparing our outputs to those compiled by DOL attorneys, and document both reasoning errors, such as confusion between related legal concepts and misinterpretation of statutory exceptions, and retrieval failures, where relevant statutory provisions are not captured. Fourth, we discover that many apparent errors are actually significant omissions by DOL attorneys themselves, such that STARA's actual accuracy is 92%. Fifth, we chart the path forward for legal RAG through concrete design principles, offering actionable guidance for building AI systems capable of accurate multi-jurisdictional legal research.

Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

TL;DR

Abstract

Paper Structure (38 sections, 15 figures, 15 tables)

This paper contains 38 sections, 15 figures, 15 tables.

Introduction
Background and Related Works
Multi-Jurisdictional Statutory Analysis
Unemployment Insurance and LaborBench
STARA and Domain-Specific Retrieval
Commercial Jurisdictional Survey Tools
Methodology
Experimental Setup
Commercial Platform Evaluation
Validation of System Outputs and DOL Report Accuracy
Results
Overall Performance Comparison
Comparative System Output Analysis
Self-Employment Assistance
SNAP Overissuance
...and 23 more sections

Figures (15)

Figure 1: Summary of our benchmarking process. DOL = United States Department of Labor; UI = Unemployment insurance; OCR = Optical character recognition; QA = question/answer; STARA = Statutory Research Assistant.
Figure 1: Performance Comparison across AI systems. The baseline represents a majority class classifier. RAG represents the best performing retrieval-augmented generation model tested by haririAIStatutorySimplification2025. STARA (Corrected) shows performance after accounting for provisions missed in DOL compilation.
Figure 2: Distribution of false positives and false negatives across Lexis+ AI, Westlaw AI, and STARA.
Figure 3: Comparative performance on identifying states with self-employment assistance programs, including both active programs and authorizing legislation. STARA identified 14 total states, 9 from Department of Labor (DOL) compilation plus 5 additional. Westlaw AI showed higher recall but numerous false positives. Lexis+ AI identified 8 states with high precision but low recall.
Figure 4: STARA false positives by error type. DOL Survey Gaps represent legitimate omissions from the expert compilation, Reasoning Errors indicate misclassification of legal provisions, and System Errors reflect technical mistakes in cross-state citation processing.
...and 10 more figures

Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

TL;DR

Abstract

Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

Authors

TL;DR

Abstract

Table of Contents

Figures (15)