Table of Contents
Fetching ...

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools

Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, Daniel E. Ho

TL;DR

The paper tackles the risk of hallucinations in AI-based legal research by conducting a preregistered, empirical evaluation of leading tools (Lexis+ AI, Westlaw AI-AR, Practical Law AI) against GPT-4. It introduces a formal framework distinguishing correctness and groundedness to classify hallucinations and builds a preregistered benchmark of 200+ queries to probe real-world performance. The findings show that, while RAG reduces hallucinations relative to general-purpose models, all examined tools still hallucinate at substantial rates (e.g., 17–33%), with notable variability across systems and task types. The work offers a detailed typology of failure modes, inter-rater reliability metrics, and practical implications for lawyers and AI vendors, underscoring the need for transparent benchmarking and careful supervision in the responsible integration of AI into legal practice.

Abstract

Legal practice has witnessed a sharp rise in products incorporating artificial intelligence (AI). Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. But the large language models used in these tools are prone to "hallucinate," or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as "eliminating" (Casetext, 2023) or "avoid[ing]" hallucinations (Thomson Reuters, 2023), or guaranteeing "hallucination-free" legal citations (LexisNexis, 2023). Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first preregistered empirical evaluation of AI-driven legal research tools. We demonstrate that the providers' claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a comprehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools

TL;DR

The paper tackles the risk of hallucinations in AI-based legal research by conducting a preregistered, empirical evaluation of leading tools (Lexis+ AI, Westlaw AI-AR, Practical Law AI) against GPT-4. It introduces a formal framework distinguishing correctness and groundedness to classify hallucinations and builds a preregistered benchmark of 200+ queries to probe real-world performance. The findings show that, while RAG reduces hallucinations relative to general-purpose models, all examined tools still hallucinate at substantial rates (e.g., 17–33%), with notable variability across systems and task types. The work offers a detailed typology of failure modes, inter-rater reliability metrics, and practical implications for lawyers and AI vendors, underscoring the need for transparent benchmarking and careful supervision in the responsible integration of AI into legal practice.

Abstract

Legal practice has witnessed a sharp rise in products incorporating artificial intelligence (AI). Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. But the large language models used in these tools are prone to "hallucinate," or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as "eliminating" (Casetext, 2023) or "avoid[ing]" hallucinations (Thomson Reuters, 2023), or guaranteeing "hallucination-free" legal citations (LexisNexis, 2023). Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first preregistered empirical evaluation of AI-driven legal research tools. We demonstrate that the providers' claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a comprehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.
Paper Structure (61 sections, 5 figures, 6 tables)

This paper contains 61 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of hallucinated and incomplete answers across generative legal research tools. Hallucinated responses are those that include false statements or falsely assert a source supports a statement. Incomplete responses are those that fail to either address the user's query or provide proper citations for factual claims.
  • Figure 2: Top left: Example of a hallucinated response by Westlaw's AI-Assisted Research product. The system makes up a statement in the Federal Rules of Bankruptcy Procedure that does not exist. Top right: Example of a hallucinated response by LexisNexis's Lexis+ AI. Casey and its undue burden standard were overruled by the Supreme Court in Dobbs v. Jackson Women's Health Organization, 597 U.S. 215 (2022); the correct answer is rational basis review. Bottom left: Example of a hallucinated response by Thomson Reuters's Ask Practical Law AI. The system fails to correct the user's mistaken premise---in reality, Justice Ginsburg joined the Court's landmark decision legalizing same-sex marriage---and instead provides additional false information about the case. Bottom right: Example of a hallucinated response from GPT-4, which generates a statutory provision that does not exist.
  • Figure 3: Schematic diagram of a retrieval-augmented generation (RAG) system. Given a user query (left), the typical process consists of two steps: (1) retrieval (middle), where the query is embedded with natural language processing and a retrieval system takes embeddings and retrieves the relevant documents (e.g., Supreme Court cases); and (2) generation (right), where the retrieved texts are fed to the language model to generate the response to the user query. Any of the subsidiary steps may introduce error and hallucinations into the generated response. (Icons are credited to FlatIcon.)
  • Figure 4: Left panel: overall percentages of accurate, incomplete, and hallucinated responses. Right panel: the percentage of answers that are hallucinated when a direct response is given. Westlaw AI-AR and Ask Practical Law AI respond to fewer queries than GPT-4, but the responses that they do produce are not significantly more trustworthy. Vertical bars denote 95% confidence intervals.
  • Figure 5: Response evaluations broken down by question category. We show the accuracy (green), incompleteness (yellow), and hallucination (red) rate for each question category. Vertical bars denote 95% confidence intervals. This figure shows that hallucinations are not driven by an isolated category and persist across task types and questions, such as bar exam and appellate litigation issues.