Table of Contents
Fetching ...

Promises and pitfalls of artificial intelligence for legal applications

Sayash Kapoor, Peter Henderson, Arvind Narayanan

TL;DR

This paper analyzes the potential and limits of AI in legal applications across information processing, creativity/reasoning, and predictive tasks. It argues that the evidence does not support a legal AI revolution and emphasizes significant evaluation challenges, such as contamination, construct validity, and prompt sensitivity. The authors propose governance-oriented recommendations, including involving legal experts in evaluating AI (e.g., LegalBench), pursuing naturalistic and task-specific evaluations, and restricting deployment to narrow, high-observability settings with strong transparency. They stress that predictive AI in law demands higher standards, transparency, and contestability to prevent harmful societal impacts. Overall, the work advocates robust socio-technical assessments to ensure safe, evidence-based adoption of AI in legal contexts.

Abstract

Is AI set to redefine the legal profession? We argue that this claim is not supported by the current evidence. We dive into AI's increasingly prevalent roles in three types of legal tasks: information processing; tasks involving creativity, reasoning, or judgment; and predictions about the future. We find that the ease of evaluating legal applications varies greatly across legal tasks, based on the ease of identifying correct answers and the observability of information relevant to the task at hand. Tasks that would lead to the most significant changes to the legal profession are also the ones most prone to overoptimism about AI capabilities, as they are harder to evaluate. We make recommendations for better evaluation and deployment of AI in legal contexts.

Promises and pitfalls of artificial intelligence for legal applications

TL;DR

This paper analyzes the potential and limits of AI in legal applications across information processing, creativity/reasoning, and predictive tasks. It argues that the evidence does not support a legal AI revolution and emphasizes significant evaluation challenges, such as contamination, construct validity, and prompt sensitivity. The authors propose governance-oriented recommendations, including involving legal experts in evaluating AI (e.g., LegalBench), pursuing naturalistic and task-specific evaluations, and restricting deployment to narrow, high-observability settings with strong transparency. They stress that predictive AI in law demands higher standards, transparency, and contestability to prevent harmful societal impacts. Overall, the work advocates robust socio-technical assessments to ensure safe, evidence-based adoption of AI in legal contexts.

Abstract

Is AI set to redefine the legal profession? We argue that this claim is not supported by the current evidence. We dive into AI's increasingly prevalent roles in three types of legal tasks: information processing; tasks involving creativity, reasoning, or judgment; and predictions about the future. We find that the ease of evaluating legal applications varies greatly across legal tasks, based on the ease of identifying correct answers and the observability of information relevant to the task at hand. Tasks that would lead to the most significant changes to the legal profession are also the ones most prone to overoptimism about AI capabilities, as they are harder to evaluate. We make recommendations for better evaluation and deployment of AI in legal contexts.
Paper Structure (5 sections, 2 figures)

This paper contains 5 sections, 2 figures.

Figures (2)

  • Figure 1: Types of evaluations of generative AI. Current evaluations of AI are often based on exam benchmarks meant for humans, such as the bar exam, and suffer from contamination: overlaps between the training and evaluation datasets. Comparing the performance of these models on real-world tasks, especially those curated by legal experts, is more likely to be useful. Since the use of generative AI is nascent, qualitative studies that observe how legal experts use these tools for day-to-day tasks are likely to be a more useful, if expensive, way of evaluating these tools.
  • Figure 2: Variation in the difficulty of evaluating AI for legal tasks. We categorize difficulty along two dimensions: clarity on correct labels and observability of relevant features. Some tasks, such as AI for categorizing requests for legal help by area of law, have clear correct answers, whereas for other tasks, such as preparing legal filings using AI, there is no clear right answer, which makes evaluation hard. Similarly, for some applications, all relevant features are available, such as for spotting common errors in legal filings. For others, relevant features are not (or cannot be) available, such as for predictive AI. As we proceed from right to left, the clarity of correct answers and observability of relevant features roughly increases.