Table of Contents
Fetching ...

Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

Mohamed Afane, Emaan Hariri, Derek Ouyang, Daniel E. Ho

TL;DR

The Statutory Research Assistant (STARA), a custom statutory research tool, and two commercial tools by Westlaw and LexisNexis marketing AI statutory survey capabilities are assessed, showing that commercial platforms fare poorly, and chart the path forward for legal RAG through concrete design principles.

Abstract

Retrieval-augmented generation (RAG) offers significant potential for legal AI, yet systematic benchmarks are sparse. Prior work introduced LaborBench to benchmark RAG models based on ostensible ground truth from an exhaustive, multi-month, manual enumeration of all U.S. state unemployment insurance requirements by U.S. Department of Labor (DOL) attorneys. That prior work found poor performance of standard RAG (70% accuracy on Boolean tasks). Here, we assess three emerging tools not previously evaluated on LaborBench: the Statutory Research Assistant (STARA), a custom statutory research tool, and two commercial tools by Westlaw and LexisNexis marketing AI statutory survey capabilities. We make five main contributions. First, we show that STARA achieves substantial performance gains, boosting accuracy to 83%. Second, we show that commercial platforms fare poorly, with accuracy of 58% (Westlaw AI) and 64% (Lexis+ AI), even worse than standard RAG. Third, we conduct a comprehensive error analysis, comparing our outputs to those compiled by DOL attorneys, and document both reasoning errors, such as confusion between related legal concepts and misinterpretation of statutory exceptions, and retrieval failures, where relevant statutory provisions are not captured. Fourth, we discover that many apparent errors are actually significant omissions by DOL attorneys themselves, such that STARA's actual accuracy is 92%. Fifth, we chart the path forward for legal RAG through concrete design principles, offering actionable guidance for building AI systems capable of accurate multi-jurisdictional legal research.

Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

TL;DR

The Statutory Research Assistant (STARA), a custom statutory research tool, and two commercial tools by Westlaw and LexisNexis marketing AI statutory survey capabilities are assessed, showing that commercial platforms fare poorly, and chart the path forward for legal RAG through concrete design principles.

Abstract

Retrieval-augmented generation (RAG) offers significant potential for legal AI, yet systematic benchmarks are sparse. Prior work introduced LaborBench to benchmark RAG models based on ostensible ground truth from an exhaustive, multi-month, manual enumeration of all U.S. state unemployment insurance requirements by U.S. Department of Labor (DOL) attorneys. That prior work found poor performance of standard RAG (70% accuracy on Boolean tasks). Here, we assess three emerging tools not previously evaluated on LaborBench: the Statutory Research Assistant (STARA), a custom statutory research tool, and two commercial tools by Westlaw and LexisNexis marketing AI statutory survey capabilities. We make five main contributions. First, we show that STARA achieves substantial performance gains, boosting accuracy to 83%. Second, we show that commercial platforms fare poorly, with accuracy of 58% (Westlaw AI) and 64% (Lexis+ AI), even worse than standard RAG. Third, we conduct a comprehensive error analysis, comparing our outputs to those compiled by DOL attorneys, and document both reasoning errors, such as confusion between related legal concepts and misinterpretation of statutory exceptions, and retrieval failures, where relevant statutory provisions are not captured. Fourth, we discover that many apparent errors are actually significant omissions by DOL attorneys themselves, such that STARA's actual accuracy is 92%. Fifth, we chart the path forward for legal RAG through concrete design principles, offering actionable guidance for building AI systems capable of accurate multi-jurisdictional legal research.
Paper Structure (38 sections, 15 figures, 15 tables)

This paper contains 38 sections, 15 figures, 15 tables.

Figures (15)

  • Figure 1: Summary of our benchmarking process. DOL = United States Department of Labor; UI = Unemployment insurance; OCR = Optical character recognition; QA = question/answer; STARA = Statutory Research Assistant.
  • Figure 1: Performance Comparison across AI systems. The baseline represents a majority class classifier. RAG represents the best performing retrieval-augmented generation model tested by haririAIStatutorySimplification2025. STARA (Corrected) shows performance after accounting for provisions missed in DOL compilation.
  • Figure 2: Distribution of false positives and false negatives across Lexis+ AI, Westlaw AI, and STARA.
  • Figure 3: Comparative performance on identifying states with self-employment assistance programs, including both active programs and authorizing legislation. STARA identified 14 total states, 9 from Department of Labor (DOL) compilation plus 5 additional. Westlaw AI showed higher recall but numerous false positives. Lexis+ AI identified 8 states with high precision but low recall.
  • Figure 4: STARA false positives by error type. DOL Survey Gaps represent legitimate omissions from the expert compilation, Reasoning Errors indicate misclassification of legal provisions, and System Errors reflect technical mistakes in cross-state citation processing.
  • ...and 10 more figures