Table of Contents
Fetching ...

The Cambridge Law Corpus: A Dataset for Legal AI Research

Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek

TL;DR

The Cambridge Law Corpus (CLC) presents a large-scale UK legal-text resource with 258,146 court decisions (1595–2020), accompanying raw texts, metadata, and expert-annotated case outcomes. It details data provenance, XML-based storage, and an agile, versioned curation process, plus 638 token-level outcome annotations for evaluation tasks. The paper benchmarks case-outcome extraction using RoBERTa (with end-to-end and two-step pipelines) and zero-shot GPT models, and performs topic modeling with Latent Dirichlet Allocation to reveal historical shifts in legal topics. Legal and ethical considerations are central, including GDPR/Open Government Licence compliance and restricted access for researchers with governance to mitigate risks. Overall, the work demonstrates strong potential for legal-AI research, including pre-training, fine-tuning, and advanced analysis on a corpus that supports both historical and contemporary legal study.

Abstract

We introduce the Cambridge Law Corpus (CLC), a dataset for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.

The Cambridge Law Corpus: A Dataset for Legal AI Research

TL;DR

The Cambridge Law Corpus (CLC) presents a large-scale UK legal-text resource with 258,146 court decisions (1595–2020), accompanying raw texts, metadata, and expert-annotated case outcomes. It details data provenance, XML-based storage, and an agile, versioned curation process, plus 638 token-level outcome annotations for evaluation tasks. The paper benchmarks case-outcome extraction using RoBERTa (with end-to-end and two-step pipelines) and zero-shot GPT models, and performs topic modeling with Latent Dirichlet Allocation to reveal historical shifts in legal topics. Legal and ethical considerations are central, including GDPR/Open Government Licence compliance and restricted access for researchers with governance to mitigate risks. Overall, the work demonstrates strong potential for legal-AI research, including pre-training, fine-tuning, and advanced analysis on a corpus that supports both historical and contemporary legal study.

Abstract

We introduce the Cambridge Law Corpus (CLC), a dataset for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.
Paper Structure (25 sections, 9 figures, 6 tables)

This paper contains 25 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: A simplified view of the UK court and tribunal structure judiciary2023.
  • Figure 2: Proportion of words in documents belonging to the listed topics. A word can belong to more than one topic. Left: Aggregated to a one-year period spanning 1950-2020. Centre: Aggregated to a one-year period spanning 2000-2020. Right: Aggregated to a ten-year period spanning 1573-2020.
  • Figure 3: Number of cases per year.
  • Figure 4: Number of cases per year, 1900 forwards.
  • Figure 5: Number of cases per court.
  • ...and 4 more figures