Table of Contents
Fetching ...

LePaRD: A Large-Scale Dataset of Judges Citing Precedents

Robert Mahari, Dominik Stammbach, Elliott Ash, Alex `Sandy' Pentland

TL;DR

LePaRD tackles the challenge of legal passage retrieval by introducing a large-scale dataset that pairs quotations from U.S. federal judicial opinions with their source passages and surrounding context. The authors construct the dataset from the Case Law Access Project, extracting millions of context-target pairs and metadata, and they benchmark multiple retrieval approaches, finding that supervised classification (e.g., DistilBERT) yields the strongest results, while traditional lexical methods struggle due to limited overlap. The work provides comprehensive baseline evaluations, analyzes dataset properties such as long-tail distributions and cross-court variability, and discusses practical implications for access to justice and potential uses in retrieval-augmented generation. By releasing LePaRD, the authors offer a resource for practice-oriented legal NLP that can drive methodological advances and more efficient legal research tools with real-world impact.

Abstract

We present the Legal Passage Retrieval Dataset LePaRD. LePaRD is a massive collection of U.S. federal judicial citations to precedent in context. The dataset aims to facilitate work on legal passage prediction, a challenging practice-oriented legal retrieval and reasoning task. Legal passage prediction seeks to predict relevant passages from precedential court decisions given the context of a legal argument. We extensively evaluate various retrieval approaches on LePaRD, and find that classification appears to work best. However, we note that legal precedent prediction is a difficult task, and there remains significant room for improvement. We hope that by publishing LePaRD, we will encourage others to engage with a legal NLP task that promises to help expand access to justice by reducing the burden associated with legal research. A subset of the LePaRD dataset is freely available and the whole dataset will be released upon publication.

LePaRD: A Large-Scale Dataset of Judges Citing Precedents

TL;DR

LePaRD tackles the challenge of legal passage retrieval by introducing a large-scale dataset that pairs quotations from U.S. federal judicial opinions with their source passages and surrounding context. The authors construct the dataset from the Case Law Access Project, extracting millions of context-target pairs and metadata, and they benchmark multiple retrieval approaches, finding that supervised classification (e.g., DistilBERT) yields the strongest results, while traditional lexical methods struggle due to limited overlap. The work provides comprehensive baseline evaluations, analyzes dataset properties such as long-tail distributions and cross-court variability, and discusses practical implications for access to justice and potential uses in retrieval-augmented generation. By releasing LePaRD, the authors offer a resource for practice-oriented legal NLP that can drive methodological advances and more efficient legal research tools with real-world impact.

Abstract

We present the Legal Passage Retrieval Dataset LePaRD. LePaRD is a massive collection of U.S. federal judicial citations to precedent in context. The dataset aims to facilitate work on legal passage prediction, a challenging practice-oriented legal retrieval and reasoning task. Legal passage prediction seeks to predict relevant passages from precedential court decisions given the context of a legal argument. We extensively evaluate various retrieval approaches on LePaRD, and find that classification appears to work best. However, we note that legal precedent prediction is a difficult task, and there remains significant room for improvement. We hope that by publishing LePaRD, we will encourage others to engage with a legal NLP task that promises to help expand access to justice by reducing the burden associated with legal research. A subset of the LePaRD dataset is freely available and the whole dataset will be released upon publication.
Paper Structure (28 sections, 5 figures, 4 tables)

This paper contains 28 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: A simple example of how judges use quotations to precedent taken from the Diamond v. Chakrabarty. In LePaRD, preceding context is extracted ahead of a quotation from the destination opinion (Diamond v. Chakrabarty). Quotations are matched to the corresponding target passage from the source opinion (Marbury v. Madison) by using the citations contained in judicial opinions. The goal of legal passage retrieval is to predict the correct target passage given the preceding context.
  • Figure 2: Schematic of how LePaRD is constructed. First, we find all quotations across all 1.7 million published federal opinions in CAP and we retain the text ahead of the quotation ("context") and the citations to other opinions. Second, we use the citations to other opinions to check whether each quotation can be matched to a passage from a prior case. If a match was found, then a training example is constructed using the relevant preceding context and the associated target passage.
  • Figure 3: Comparing citations to judicial opinions from the same court ("self citation") to citations to other courts ("cross cite"). We find that appellate courts are most likely to cite themselves, while district courts only rarely cite their own precedent.
  • Figure 4: Distribution of time in units of log days between the first and last citation of a passage in our data.
  • Figure 5: Hierarchical clustering of passage co-occurrence.