Table of Contents
Fetching ...

SEAL: Segment-wise Extractive-Abstractive Long-form Text Summarization

Yao Zhao, Mohammad Saleh, Peter J. Liu

TL;DR

SEAL addresses long-form neural abstractive summarization by introducing a segment-wise extractive-abstractive model that attends to a sparse set of input snippets for each output segment. The model is trained end-to-end with jointly optimized abstractive loss and weak extractive supervision via proxy labels, enabling dynamic content selection across very long inputs (up to 100,000 tokens). It achieves state-of-the-art results on arXiv and PubMed and demonstrates strong performance on a new long-input dataset, Search2Wiki, while improving interpretability through explicit snippet selection. The work unifies SDS and MDS, extends extractive-abstractive paradigms, and shows that longer input context coupled with targeted content selection yields substantial gains for LF-NAS.

Abstract

Most prior work in the sequence-to-sequence paradigm focused on datasets with input sequence lengths in the hundreds of tokens due to the computational constraints of common RNN and Transformer architectures. In this paper, we study long-form abstractive text summarization, a sequence-to-sequence setting with input sequence lengths up to 100,000 tokens and output sequence lengths up to 768 tokens. We propose SEAL, a Transformer-based model, featuring a new encoder-decoder attention that dynamically extracts/selects input snippets to sparsely attend to for each output segment. Using only the original documents and summaries, we derive proxy labels that provide weak supervision for extractive layers simultaneously with regular supervision from abstractive summaries. The SEAL model achieves state-of-the-art results on existing long-form summarization tasks, and outperforms strong baseline models on a new dataset/task we introduce, Search2Wiki, with much longer input text. Since content selection is explicit in the SEAL model, a desirable side effect is that the selection can be inspected for enhanced interpretability.

SEAL: Segment-wise Extractive-Abstractive Long-form Text Summarization

TL;DR

SEAL addresses long-form neural abstractive summarization by introducing a segment-wise extractive-abstractive model that attends to a sparse set of input snippets for each output segment. The model is trained end-to-end with jointly optimized abstractive loss and weak extractive supervision via proxy labels, enabling dynamic content selection across very long inputs (up to 100,000 tokens). It achieves state-of-the-art results on arXiv and PubMed and demonstrates strong performance on a new long-input dataset, Search2Wiki, while improving interpretability through explicit snippet selection. The work unifies SDS and MDS, extends extractive-abstractive paradigms, and shows that longer input context coupled with targeted content selection yields substantial gains for LF-NAS.

Abstract

Most prior work in the sequence-to-sequence paradigm focused on datasets with input sequence lengths in the hundreds of tokens due to the computational constraints of common RNN and Transformer architectures. In this paper, we study long-form abstractive text summarization, a sequence-to-sequence setting with input sequence lengths up to 100,000 tokens and output sequence lengths up to 768 tokens. We propose SEAL, a Transformer-based model, featuring a new encoder-decoder attention that dynamically extracts/selects input snippets to sparsely attend to for each output segment. Using only the original documents and summaries, we derive proxy labels that provide weak supervision for extractive layers simultaneously with regular supervision from abstractive summaries. The SEAL model achieves state-of-the-art results on existing long-form summarization tasks, and outperforms strong baseline models on a new dataset/task we introduce, Search2Wiki, with much longer input text. Since content selection is explicit in the SEAL model, a desirable side effect is that the selection can be inspected for enhanced interpretability.

Paper Structure

This paper contains 21 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Model architectures. $E_\theta$, $D_\phi$, $S_\psi$, $S_\psi^*$ are encoder, decoder and scorers that contains trainable parameters $\theta$, $\phi$, $\psi$. $G$ is the gating function that selects and concatenates top scored snippet representations up to a certain maximum length. $x_{i}$ are inputs snippet IDs (each $x_i$ is a sequence of IDs), $x'_{i}$ are encoded/compressed representations of input snippets. $y_\tau$, $y_{t<\tau}$, $y_{seg_j}$ are current decode IDs, all previous decode IDs, and previous decode IDs in segment j. In this figure, there are in total 7 inputs snippets and decoders always attend up to 3 input representations, the SEAL model is decoding the third segment.
  • Figure 2: Losses and how gradients flow. The Left side are Trunc and CA model. The right side are EA model and SEAL model. $l_a$ and $l_e$ are abstractive and extractive loss, red arrows are gradients.
  • Figure 3: Illustration of encoder self-attention and encoder-decoder attention maps for four models considered. $x_{i}$ are inputs snippets (encoders' inputs), $x'_{i}$ are encoded/compressed representations (encoders' outputs, decoders' inputs) that correspond to input snippets. $ys_j$ are decode segments (decoder's outputs) representing parts of the long decode sequence. Encoder self-attentions from $x_{i}$ to $x'_{i}$ are colored in blue and encoder-decoder attentions from $x'_{i}$ to $ys_{j}$ are colored in red. Note each square represents a sequence of tokens in a input snippet or a decode segment, not a single token.
  • Figure 4: On the arXiv dataset, (a) Trunc models trained on different maximum input length, $L_{input}$. (b) EA models trained on different maximum extractive length, $L_{ext}$. (c) Effect of segment length $L_{seg}$ and maximum extractive length $L_{ext}$ for SEAL model on the arXiv dataset.
  • Figure 5: Visualization of the SEAL model on an CNN/DailyMail example (best viewed in color). Segments of decodes are colored differently. Input snippets each segment attends to are are colored accordingly and the segment ids are inserted to the front. When multiple segment attend to the same input snippet, it is colored as the first segment.