Table of Contents
Fetching ...

LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs

Serene Wang, Lavanya Pobbathi, Haihua Chen

TL;DR

Results show that chain-of-thought prompting substantially improves LLM performance, while domain-specific models exhibit more stable zero-shot behavior, and LAMUS provides a scalable resource and empirical insights for future legal NLP research.

Abstract

Legal argument mining aims to identify and classify the functional components of judicial reasoning, such as facts, issues, rules, analysis, and conclusions. Progress in this area is limited by the lack of large-scale, high-quality annotated datasets for U.S. caselaw, particularly at the state level. This paper introduces LAMUS, a sentence-level legal argument mining corpus constructed from U.S. Supreme Court decisions and Texas criminal appellate opinions. The dataset is created using a data-centric pipeline that combines large-scale case collection, LLM-based automatic annotation, and targeted human-in-the-loop quality refinement. We formulate legal argument mining as a six-class sentence classification task and evaluate multiple general-purpose and legal-domain language models under zero-shot, few-shot, and chain-of-thought prompting strategies, with LegalBERT as a supervised baseline. Results show that chain-of-thought prompting substantially improves LLM performance, while domain-specific models exhibit more stable zero-shot behavior. LLM-assisted verification corrects nearly 20% of annotation errors, improving label consistency. Human verification achieves Cohen's Kappa of 0.85, confirming annotation quality. LAMUS provides a scalable resource and empirical insights for future legal NLP research. All code and datasets can be accessed for reproducibility on GitHub at: https://github.com/LavanyaPobbathi/LAMUS/tree/main

LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs

TL;DR

Results show that chain-of-thought prompting substantially improves LLM performance, while domain-specific models exhibit more stable zero-shot behavior, and LAMUS provides a scalable resource and empirical insights for future legal NLP research.

Abstract

Legal argument mining aims to identify and classify the functional components of judicial reasoning, such as facts, issues, rules, analysis, and conclusions. Progress in this area is limited by the lack of large-scale, high-quality annotated datasets for U.S. caselaw, particularly at the state level. This paper introduces LAMUS, a sentence-level legal argument mining corpus constructed from U.S. Supreme Court decisions and Texas criminal appellate opinions. The dataset is created using a data-centric pipeline that combines large-scale case collection, LLM-based automatic annotation, and targeted human-in-the-loop quality refinement. We formulate legal argument mining as a six-class sentence classification task and evaluate multiple general-purpose and legal-domain language models under zero-shot, few-shot, and chain-of-thought prompting strategies, with LegalBERT as a supervised baseline. Results show that chain-of-thought prompting substantially improves LLM performance, while domain-specific models exhibit more stable zero-shot behavior. LLM-assisted verification corrects nearly 20% of annotation errors, improving label consistency. Human verification achieves Cohen's Kappa of 0.85, confirming annotation quality. LAMUS provides a scalable resource and empirical insights for future legal NLP research. All code and datasets can be accessed for reproducibility on GitHub at: https://github.com/LavanyaPobbathi/LAMUS/tree/main
Paper Structure (47 sections, 8 figures, 18 tables)

This paper contains 47 sections, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Methodology for corpus construction, LLM-based sentence annotation, data quality assessment, and model evaluation in the LAMUS dataset.
  • Figure 2: Comparison of Texas Case Law dataset sentence labels before and after the data cleaning process, indicating improved quality.
  • Figure 3: Accuracy of LLaMA-3-8B and SaulLM-54B under varying few-shot example counts. Few-shot prompting harms performance for LLaMA-3-8B, while SaulLM-54B remains relatively stable.
  • Figure 4: Ablation Study: Learning Rate vs. Accuracy by LoRA Rank and Training Epochs. Results from 36 experiments across 4 learning rates (1e-5, 5e-5, 1e-4, 2e-4), 3 LoRA ranks (8, 16, 32), and 3 epoch settings (1, 3, 5). The best configuration (LR=1e-4, Rank=8, Epochs=5) achieves 85.32% accuracy.
  • Figure 5: Overall distribution of legal argument label categories across all Supreme Court eras (1921--present), comprising 2,900,083 sentences. The corpus is dominated by Analysis, Rule/Law/Holding, and Facts, while Issue and Conclusion occur less frequently but consistently.
  • ...and 3 more figures