Summarizing long regulatory documents with a multi-step pipeline

Mika Sie; Ruby Beek; Michiel Bots; Sjaak Brinkkemper; Albert Gatt

Summarizing long regulatory documents with a multi-step pipeline

Mika Sie, Ruby Beek, Michiel Bots, Sjaak Brinkkemper, Albert Gatt

TL;DR

This work tackles the challenge of summarising long regulatory documents by proposing a two-step extractive–abstractive pipeline that uses document chunking to bypass context-length limits. It systematically compares a wide range of extractive and abstractive architectures, including general-purpose RoBERTa, Longformer, LexLM, and decoder-only Llama3, under fixed, dependent, and hybrid compression strategies on EUR-Lex-Sum, evaluated with automated metrics and expert human judgments. Key findings show that a two-step approach benefits decoder-only models but can hinder long-context encoder–decoder models; shorter extractive contexts and RoBERTa-based extractors with dependent compression often yield strong automated scores, while human evaluators preferred LexLM-based extractors and long-context setups. The results emphasize matching the summarisation strategy to model architecture and context length, and reveal gaps between automatic metrics and expert judgments in regulatory-text summarisation. The study provides practical guidance for scalable regulatory document summarisation and outlines directions for broader generalisability and more robust human-centered evaluation.

Abstract

Due to their length and complexity, long regulatory texts are challenging to summarize. To address this, a multi-step extractive-abstractive architecture is proposed to handle lengthy regulatory documents more effectively. In this paper, we show that the effectiveness of a two-step architecture for summarizing long regulatory texts varies significantly depending on the model used. Specifically, the two-step architecture improves the performance of decoder-only models. For abstractive encoder-decoder models with short context lengths, the effectiveness of an extractive step varies, whereas for long-context encoder-decoder models, the extractive step worsens their performance. This research also highlights the challenges of evaluating generated texts, as evidenced by the differing results from human and automated evaluations. Most notably, human evaluations favoured language models pretrained on legal text, while automated metrics rank general-purpose language models higher. The results underscore the importance of selecting the appropriate summarization strategy based on model architecture and context length.

Summarizing long regulatory documents with a multi-step pipeline

TL;DR

Abstract

Summarizing long regulatory documents with a multi-step pipeline

Authors

TL;DR

Abstract

Table of Contents

Figures (2)