Table of Contents
Fetching ...

AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction

Magnus Sesodia, Alina Petrova, John Armour, Thomas Lukasiewicz, Oana-Maria Camburu, Puneet K. Dokania, Philip Torr, Christian Schroeder de Witt

TL;DR

AnnoCaseLaw introduces a richly annotated dataset of 471 U.S. civil negligence appellate decisions designed to advance explainable legal judgment prediction. By defining three tasks—Judgment Prediction, Concept Identification, and Automated Case Annotation—and benchmarking strong LLMs, the work reveals persistent difficulty in legal reasoning, especially with precedent-based inferences and nuanced concept labeling. The dataset provides expert token-level annotations across five types and 36 binary concepts, enabling interpretable AI research and more human-aligned reasoning in legal NLP. The findings suggest that targeted fine-tuning, few-shot prompting, and future self-explanation approaches are needed to achieve reliable performance, with promising directions in concept-based models and bias analysis for fairer legal AI systems.

Abstract

Legal systems worldwide continue to struggle with overwhelming caseloads, limited judicial resources, and growing complexities in legal proceedings. Artificial intelligence (AI) offers a promising solution, with Legal Judgment Prediction (LJP) -- the practice of predicting a court's decision from the case facts -- emerging as a key research area. However, existing datasets often formulate the task of LJP unrealistically, not reflecting its true difficulty. They also lack high-quality annotation essential for legal reasoning and explainability. To address these shortcomings, we introduce AnnoCaseLaw, a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases. Each case is enriched with comprehensive, expert-labeled annotations that highlight key components of judicial decision making, along with relevant legal concepts. Our dataset lays the groundwork for more human-aligned, explainable LJP models. We define three legally relevant tasks: (1) judgment prediction; (2) concept identification; and (3) automated case annotation, and establish a performance baseline using industry-leading large language models (LLMs). Our results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult. Code and data are available at https://github.com/anonymouspolar1/annocaselaw.

AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction

TL;DR

AnnoCaseLaw introduces a richly annotated dataset of 471 U.S. civil negligence appellate decisions designed to advance explainable legal judgment prediction. By defining three tasks—Judgment Prediction, Concept Identification, and Automated Case Annotation—and benchmarking strong LLMs, the work reveals persistent difficulty in legal reasoning, especially with precedent-based inferences and nuanced concept labeling. The dataset provides expert token-level annotations across five types and 36 binary concepts, enabling interpretable AI research and more human-aligned reasoning in legal NLP. The findings suggest that targeted fine-tuning, few-shot prompting, and future self-explanation approaches are needed to achieve reliable performance, with promising directions in concept-based models and bias analysis for fairer legal AI systems.

Abstract

Legal systems worldwide continue to struggle with overwhelming caseloads, limited judicial resources, and growing complexities in legal proceedings. Artificial intelligence (AI) offers a promising solution, with Legal Judgment Prediction (LJP) -- the practice of predicting a court's decision from the case facts -- emerging as a key research area. However, existing datasets often formulate the task of LJP unrealistically, not reflecting its true difficulty. They also lack high-quality annotation essential for legal reasoning and explainability. To address these shortcomings, we introduce AnnoCaseLaw, a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases. Each case is enriched with comprehensive, expert-labeled annotations that highlight key components of judicial decision making, along with relevant legal concepts. Our dataset lays the groundwork for more human-aligned, explainable LJP models. We define three legally relevant tasks: (1) judgment prediction; (2) concept identification; and (3) automated case annotation, and establish a performance baseline using industry-leading large language models (LLMs). Our results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult. Code and data are available at https://github.com/anonymouspolar1/annocaselaw.

Paper Structure

This paper contains 32 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A (very short) example case (source: https://cite.case.law/ark/264/691/). The annotations are for Facts, Procedural History, Relevant Precedents, Application of Law to Facts, and Outcome. The legal concepts labels are {'Duty of Care': 0, Breach of Duty': 1, 'Contributory Negligence': 0}.
  • Figure 2: Task #1 judgment prediction class-weighted-F1 score for all three subtasks: (a) using Facts and Procedural History; (b) + Relevant Precedents; (c) + Application of Law to Facts. Error bars denote 95% confidence interval. Average is the mean of the weighted-F1 scores across subtasks (a)--(c).
  • Figure 3: Task #2: Concept Identification. Each of the three x-axis macro-level concepts is predicted for every case in the dataset. Error bars denote 95% confidence intervals.
  • Figure 4: Task #3: Annotation. The model has to highlight the relevant parts of the full case text corresponding to the x-axis annotations types. Error bars denote standard error of the mean
  • Figure 5: The instructions given to legal scholars on how to annotate the cases.