Table of Contents
Fetching ...

Extractive Summarization as Text Matching

Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, Xuanjing Huang

TL;DR

This work reframes extractive summarization as a semantic text matching task, introducing MatchSum and a Siamese-BERT framework to compare a document with candidate extract summaries in a shared embedding space. It argues that dataset characteristics create an inherent gap between sentence-level and summary-level extractors, motivating a shift to summary-level optimization and a margin-based learning objective. Through experiments on six diverse datasets, the approach delivers state-of-the-art results on CNN/DailyMail and robust improvements across short and long summaries, supported by a detailed analysis of pearl-summaries and dataset properties. The paper also contributes methodology for candidate pruning to manage combinatorial search and releases code and data to foster further exploration of matching-based summarization.

Abstract

This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems. Instead of following the commonly used framework of extracting sentences individually and modeling the relationship between sentences, we formulate the extractive summarization task as a semantic text matching problem, in which a source document and candidate summaries will be (extracted from the original text) matched in a semantic space. Notably, this paradigm shift to semantic matching framework is well-grounded in our comprehensive analysis of the inherent gap between sentence-level and summary-level extractors based on the property of the dataset. Besides, even instantiating the framework with a simple form of a matching model, we have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1). Experiments on the other five datasets also show the effectiveness of the matching framework. We believe the power of this matching-based summarization framework has not been fully exploited. To encourage more instantiations in the future, we have released our codes, processed dataset, as well as generated summaries in https://github.com/maszhongming/MatchSum.

Extractive Summarization as Text Matching

TL;DR

This work reframes extractive summarization as a semantic text matching task, introducing MatchSum and a Siamese-BERT framework to compare a document with candidate extract summaries in a shared embedding space. It argues that dataset characteristics create an inherent gap between sentence-level and summary-level extractors, motivating a shift to summary-level optimization and a margin-based learning objective. Through experiments on six diverse datasets, the approach delivers state-of-the-art results on CNN/DailyMail and robust improvements across short and long summaries, supported by a detailed analysis of pearl-summaries and dataset properties. The paper also contributes methodology for candidate pruning to manage combinatorial search and releases code and data to foster further exploration of matching-based summarization.

Abstract

This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems. Instead of following the commonly used framework of extracting sentences individually and modeling the relationship between sentences, we formulate the extractive summarization task as a semantic text matching problem, in which a source document and candidate summaries will be (extracted from the original text) matched in a semantic space. Notably, this paradigm shift to semantic matching framework is well-grounded in our comprehensive analysis of the inherent gap between sentence-level and summary-level extractors based on the property of the dataset. Besides, even instantiating the framework with a simple form of a matching model, we have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1). Experiments on the other five datasets also show the effectiveness of the matching framework. We believe the power of this matching-based summarization framework has not been fully exploited. To encourage more instantiations in the future, we have released our codes, processed dataset, as well as generated summaries in https://github.com/maszhongming/MatchSum.

Paper Structure

This paper contains 25 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: MatchSum framework. We match the contextual representations of the document with gold summary and candidate summaries (extracted from the document). Intuitively, better candidate summaries should be semantically closer to the document, while the gold summary should be the closest.
  • Figure 2: Distribution of $z(\%)$ on six datasets. Because the number of candidate summaries for each document is different (short text may have relatively few candidates), we use $z$ / number of candidate summaries as the X-axis. The Y-axis represents the proportion of the best-summaries with this rank in the test set.
  • Figure 3: $\Delta(\mathcal{D})$ for different datasets.
  • Figure 4: Datasets splitting experiment. We split test sets into five parts according to $z$ described in Section \ref{['sec:ranking']}. The X-axis from left to right indicates the subsets of the test set with the value of $z$ from small to large, and the Y-axis represents the ROUGE improvement of MatchSum over BertExt on this subset.
  • Figure 5: $\psi$ of different datasets. Reddit is excluded because it has too few samples in the test set.

Theorems & Definitions (2)

  • Definition 1
  • Definition 2