Table of Contents
Fetching ...

OrderSum: Semantic Sentence Ordering for Extractive Summarization

Taewan Kwon, Sangyong Lee

TL;DR

OrderSum addresses the underexplored problem of sentence order in extractive summarization by embedding candidate summaries in a semantic space that encodes sentence order. It combines sentence extraction with a summary-level triplet ranking objective that integrates $ROUGE$ signals, including $ROUGE-L_{full}$, and employs anchor candidate sampling to manage training cost. Empirically, OrderSum yields state-of-the-art $ROUGE-L$ on CNN/DailyMail (30.52, up to +2.54) and shows strong performance on XSum, WikiHow, and PubMed, while qualitative analyses confirm improved sentence order over prior methods. The work demonstrates the practical impact of optimizing at the summary level for both inclusion and ordering, and highlights avenues for future work with longer summaries and abstractive extensions.

Abstract

There are two main approaches to recent extractive summarization: the sentence-level framework, which selects sentences to include in a summary individually, and the summary-level framework, which generates multiple candidate summaries and ranks them. Previous work in both frameworks has primarily focused on improving which sentences in a document should be included in the summary. However, the sentence order of extractive summaries, which is critical for the quality of a summary, remains underexplored. In this paper, we introduce OrderSum, a novel extractive summarization model that semantically orders sentences within an extractive summary. OrderSum proposes a new representation method to incorporate the sentence order into the embedding of the extractive summary, and an objective function to train the model to identify which extractive summary has a better sentence order in the semantic space. Extensive experimental results demonstrate that OrderSum obtains state-of-the-art performance in both sentence inclusion and sentence order for extractive summarization. In particular, OrderSum achieves a ROUGE-L score of 30.52 on CNN/DailyMail, outperforming the previous state-of-the-art model by a large margin of 2.54.

OrderSum: Semantic Sentence Ordering for Extractive Summarization

TL;DR

OrderSum addresses the underexplored problem of sentence order in extractive summarization by embedding candidate summaries in a semantic space that encodes sentence order. It combines sentence extraction with a summary-level triplet ranking objective that integrates signals, including , and employs anchor candidate sampling to manage training cost. Empirically, OrderSum yields state-of-the-art on CNN/DailyMail (30.52, up to +2.54) and shows strong performance on XSum, WikiHow, and PubMed, while qualitative analyses confirm improved sentence order over prior methods. The work demonstrates the practical impact of optimizing at the summary level for both inclusion and ordering, and highlights avenues for future work with longer summaries and abstractive extensions.

Abstract

There are two main approaches to recent extractive summarization: the sentence-level framework, which selects sentences to include in a summary individually, and the summary-level framework, which generates multiple candidate summaries and ranks them. Previous work in both frameworks has primarily focused on improving which sentences in a document should be included in the summary. However, the sentence order of extractive summaries, which is critical for the quality of a summary, remains underexplored. In this paper, we introduce OrderSum, a novel extractive summarization model that semantically orders sentences within an extractive summary. OrderSum proposes a new representation method to incorporate the sentence order into the embedding of the extractive summary, and an objective function to train the model to identify which extractive summary has a better sentence order in the semantic space. Extensive experimental results demonstrate that OrderSum obtains state-of-the-art performance in both sentence inclusion and sentence order for extractive summarization. In particular, OrderSum achieves a ROUGE-L score of 30.52 on CNN/DailyMail, outperforming the previous state-of-the-art model by a large margin of 2.54.

Paper Structure

This paper contains 23 sections, 13 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The structure of OrderSum is based on the summary-level framework. The first step is to extract key sentences using the extractor. In the second step, OrderSum generates the set of candidate summaries with different sentence orders. In the next step, OrderSum obtains the candidate summary embeddings representing the sentence order of each summary. Finally, OrderSum trains the sentence order using a new objective function in the semantic space.
  • Figure 2: Validation graphs for ROUGE-1, ROUGE-2, and ROUGE-L scores during the training of BARTSUM 1024, CoLo 1024, and OrderSum 1024 on CNN/DailyMail. Training is conducted for 12K steps, with validation performed every 1,000 steps.
  • Figure 3: Validation graphs for ROUGE-L scores during the training of BARTSUM, CoLo, and OrderSum on the three datasets. Training is conducted for 10K, 12K, and 7K steps on XSum, WikiHow, and PubMed, respectively, with validation performed every 1,000 steps.
  • Figure 4: ROUGE-1, ROUGE-2, $\text{ROUGE-L}_{norm}$, and $\text{ROUGE-L}_{full}$ scores for every pair of summaries that share the same sentences but have different sentence orders. The indices of the x-axis and y-axis in each plot indicate each of the six summaries below. If the score between two summaries is close to 1, the ROUGE score hardly detects the difference in sentence order and is not suitable for evaluating sentence order.
  • Figure 5: Validation graphs for ROUGE-1, ROUGE-2, and ROUGE-L scores during the training of BARTSUM, CoLo, and OrderSum on CNN/DailyMail, XSum, WikiHow, and PubMed. On CNN/DailyMail, BARTSUM 1024, CoLo 1024, and OrderSum 1024 are used to obtain the graphs. BARTSUM, CoLo, and OrderSum are used on the remaining three datasets. Training is conducted for 12K, 10K, 12K, and 7K steps on CNN/DailyMail, XSum, WikiHow, and PubMed, respectively, with validation performed every 1,000 steps.