The Cambridge Law Corpus: A Dataset for Legal AI Research

Andreas Östling; Holli Sargeant; Huiyuan Xie; Ludwig Bull; Alexander Terenin; Leif Jonsson; Måns Magnusson; Felix Steffek

The Cambridge Law Corpus: A Dataset for Legal AI Research

Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek

TL;DR

The Cambridge Law Corpus (CLC) presents a large-scale UK legal-text resource with 258,146 court decisions (1595–2020), accompanying raw texts, metadata, and expert-annotated case outcomes. It details data provenance, XML-based storage, and an agile, versioned curation process, plus 638 token-level outcome annotations for evaluation tasks. The paper benchmarks case-outcome extraction using RoBERTa (with end-to-end and two-step pipelines) and zero-shot GPT models, and performs topic modeling with Latent Dirichlet Allocation to reveal historical shifts in legal topics. Legal and ethical considerations are central, including GDPR/Open Government Licence compliance and restricted access for researchers with governance to mitigate risks. Overall, the work demonstrates strong potential for legal-AI research, including pre-training, fine-tuning, and advanced analysis on a corpus that supports both historical and contemporary legal study.

Abstract

We introduce the Cambridge Law Corpus (CLC), a dataset for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.

The Cambridge Law Corpus: A Dataset for Legal AI Research

TL;DR

Abstract

Paper Structure (25 sections, 9 figures, 6 tables)

This paper contains 25 sections, 9 figures, 6 tables.

Introduction
The Cambridge Law Corpus
The United Kingdom's Legal System
Corpus Content
Corpus Creation and Curation Process
Case Outcome Annotations
Legal and Ethical Considerations
Experiments
Case Outcome Extraction
Topic Model Analysis
Conclusion
Detailed Information on Corpus Content
Example XML case
Case Outcome Task Description
Ablation study on sentence classification step in the two-step RoBERTa pipeline
...and 10 more sections

Figures (9)

Figure 1: A simplified view of the UK court and tribunal structure judiciary2023.
Figure 2: Proportion of words in documents belonging to the listed topics. A word can belong to more than one topic. Left: Aggregated to a one-year period spanning 1950-2020. Centre: Aggregated to a one-year period spanning 2000-2020. Right: Aggregated to a ten-year period spanning 1573-2020.
Figure 3: Number of cases per year.
Figure 4: Number of cases per year, 1900 forwards.
Figure 5: Number of cases per court.
...and 4 more figures

The Cambridge Law Corpus: A Dataset for Legal AI Research

TL;DR

Abstract

The Cambridge Law Corpus: A Dataset for Legal AI Research

Authors

TL;DR

Abstract

Table of Contents

Figures (9)