Table of Contents
Fetching ...

BillSum: A Corpus for Automatic Summarization of US Legislation

Anastassia Kornilova, Vlad Eidelman

TL;DR

BillSum introduces the first dedicated corpus for automatic summarization of US legislative text, comprising 22,218 US Congressional bills with reference summaries and an additional 1,237 California bills to test transferability. The authors benchmark extractive approaches, including a DOC feature-based random forest, a SUM BERT-based classifier, and an ensemble using Maximal Marginal Relevance to produce 2000-character summaries, demonstrating improvements over traditional unsupervised baselines. Results show that the SUM model typically performs best on US bills and that methods trained on US data can transfer to California bills, albeit with reduced accuracy, underscoring domain-specific challenges and potential for cross-jurisdictional applicability. The work also analyzes the linguistic characteristics of summary language via KL-divergence, finding that action-oriented terms drive summaries, which provides interpretability guidance for future model development and multi-state corpus expansion.

Abstract

Automatic summarization methods have been studied on a variety of domains, including news and scientific articles. Yet, legislation has not previously been considered for this task, despite US Congress and state governments releasing tens of thousands of bills every year. In this paper, we introduce BillSum, the first dataset for summarization of US Congressional and California state bills (https://github.com/FiscalNote/BillSum). We explain the properties of the dataset that make it more challenging to process than other domains. Then, we benchmark extractive methods that consider neural sentence representations and traditional contextual features. Finally, we demonstrate that models built on Congressional bills can be used to summarize California bills, thus, showing that methods developed on this dataset can transfer to states without human-written summaries.

BillSum: A Corpus for Automatic Summarization of US Legislation

TL;DR

BillSum introduces the first dedicated corpus for automatic summarization of US legislative text, comprising 22,218 US Congressional bills with reference summaries and an additional 1,237 California bills to test transferability. The authors benchmark extractive approaches, including a DOC feature-based random forest, a SUM BERT-based classifier, and an ensemble using Maximal Marginal Relevance to produce 2000-character summaries, demonstrating improvements over traditional unsupervised baselines. Results show that the SUM model typically performs best on US bills and that methods trained on US data can transfer to California bills, albeit with reduced accuracy, underscoring domain-specific challenges and potential for cross-jurisdictional applicability. The work also analyzes the linguistic characteristics of summary language via KL-divergence, finding that action-oriented terms drive summaries, which provides interpretability guidance for future model development and multi-state corpus expansion.

Abstract

Automatic summarization methods have been studied on a variety of domains, including news and scientific articles. Yet, legislation has not previously been considered for this task, despite US Congress and state governments releasing tens of thousands of bills every year. In this paper, we introduce BillSum, the first dataset for summarization of US Congressional and California state bills (https://github.com/FiscalNote/BillSum). We explain the properties of the dataset that make it more challenging to process than other domains. Then, we benchmark extractive methods that consider neural sentence representations and traditional contextual features. Finally, we demonstrate that models built on Congressional bills can be used to summarize California bills, thus, showing that methods developed on this dataset can transfer to states without human-written summaries.

Paper Structure

This paper contains 16 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Bill Lengths
  • Figure 2: Example US Bill
  • Figure 3: US H.R.1680 (115th)
  • Figure 4: US H.R.6355 (115th)
  • Figure 5: California Bill Summary