AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction
Magnus Sesodia, Alina Petrova, John Armour, Thomas Lukasiewicz, Oana-Maria Camburu, Puneet K. Dokania, Philip Torr, Christian Schroeder de Witt
TL;DR
AnnoCaseLaw introduces a richly annotated dataset of 471 U.S. civil negligence appellate decisions designed to advance explainable legal judgment prediction. By defining three tasks—Judgment Prediction, Concept Identification, and Automated Case Annotation—and benchmarking strong LLMs, the work reveals persistent difficulty in legal reasoning, especially with precedent-based inferences and nuanced concept labeling. The dataset provides expert token-level annotations across five types and 36 binary concepts, enabling interpretable AI research and more human-aligned reasoning in legal NLP. The findings suggest that targeted fine-tuning, few-shot prompting, and future self-explanation approaches are needed to achieve reliable performance, with promising directions in concept-based models and bias analysis for fairer legal AI systems.
Abstract
Legal systems worldwide continue to struggle with overwhelming caseloads, limited judicial resources, and growing complexities in legal proceedings. Artificial intelligence (AI) offers a promising solution, with Legal Judgment Prediction (LJP) -- the practice of predicting a court's decision from the case facts -- emerging as a key research area. However, existing datasets often formulate the task of LJP unrealistically, not reflecting its true difficulty. They also lack high-quality annotation essential for legal reasoning and explainability. To address these shortcomings, we introduce AnnoCaseLaw, a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases. Each case is enriched with comprehensive, expert-labeled annotations that highlight key components of judicial decision making, along with relevant legal concepts. Our dataset lays the groundwork for more human-aligned, explainable LJP models. We define three legally relevant tasks: (1) judgment prediction; (2) concept identification; and (3) automated case annotation, and establish a performance baseline using industry-leading large language models (LLMs). Our results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult. Code and data are available at https://github.com/anonymouspolar1/annocaselaw.
