Table of Contents
Fetching ...

LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English

T. Y. S. S. Santosh, Cornelius Weiss, Matthias Grabmair

TL;DR

LexSumm addresses a key gap in legal NLP by offering a benchmark for English legal summarization across eight jurisdictional datasets and by introducing LexT5, a seq2seq model pre-trained on LeXFiles for legal knowledge. The work demonstrates that long-context, retrieval-augmented frameworks and domain-specific pre-training yield strong performance, often surpassing baselines and zero-shot LLMs on LexSumm tasks. Zero-shot analyses with LegalLAMA reveal abstraction and faithfulness challenges even for large language models, underscoring the need for targeted pre-training and faithful decoding strategies. Overall, LexSumm and LexT5 provide a foundation for robust legal generation research and are released to the community to foster further development in legal NLP.

Abstract

In the evolving NLP landscape, benchmarks serve as yardsticks for gauging progress. However, existing Legal NLP benchmarks only focus on predictive tasks, overlooking generative tasks. This work curates LexSumm, a benchmark designed for evaluating legal summarization tasks in English. It comprises eight English legal summarization datasets, from diverse jurisdictions, such as the US, UK, EU and India. Additionally, we release LexT5, legal oriented sequence-to-sequence model, addressing the limitation of the existing BERT-style encoder-only models in the legal domain. We assess its capabilities through zero-shot probing on LegalLAMA and fine-tuning on LexSumm. Our analysis reveals abstraction and faithfulness errors even in summaries generated by zero-shot LLMs, indicating opportunities for further improvements. LexSumm benchmark and LexT5 model are available at https://github.com/TUMLegalTech/LexSumm-LexT5.

LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English

TL;DR

LexSumm addresses a key gap in legal NLP by offering a benchmark for English legal summarization across eight jurisdictional datasets and by introducing LexT5, a seq2seq model pre-trained on LeXFiles for legal knowledge. The work demonstrates that long-context, retrieval-augmented frameworks and domain-specific pre-training yield strong performance, often surpassing baselines and zero-shot LLMs on LexSumm tasks. Zero-shot analyses with LegalLAMA reveal abstraction and faithfulness challenges even for large language models, underscoring the need for targeted pre-training and faithful decoding strategies. Overall, LexSumm and LexT5 provide a foundation for robust legal generation research and are released to the community to foster further development in legal NLP.

Abstract

In the evolving NLP landscape, benchmarks serve as yardsticks for gauging progress. However, existing Legal NLP benchmarks only focus on predictive tasks, overlooking generative tasks. This work curates LexSumm, a benchmark designed for evaluating legal summarization tasks in English. It comprises eight English legal summarization datasets, from diverse jurisdictions, such as the US, UK, EU and India. Additionally, we release LexT5, legal oriented sequence-to-sequence model, addressing the limitation of the existing BERT-style encoder-only models in the legal domain. We assess its capabilities through zero-shot probing on LegalLAMA and fine-tuning on LexSumm. Our analysis reveals abstraction and faithfulness errors even in summaries generated by zero-shot LLMs, indicating opportunities for further improvements. LexSumm benchmark and LexT5 model are available at https://github.com/TUMLegalTech/LexSumm-LexT5.

Paper Structure

This paper contains 32 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Distribution of input length, summary length and fusion scores for LexSumm datasets.
  • Figure 2: Distribution of input length, summary length and fusion scores for LexSumm datasets.
  • Figure 3: Visualization of Hierarchical merging strategy for summarization of Long Inputs: "S" represents the summary, "C" denotes the chunk, "n" is the total number of chunks, and "k" indicates the maximum number of levels