Table of Contents
Fetching ...

A Divide-and-Conquer Approach to the Summarization of Long Documents

Alexios Gidiotis, Grigorios Tsoumakas

TL;DR

The paper tackles the high compute and noise challenges of long-document summarization by introducing DANCER, a divide-and-conquer framework that learns per-section summaries and then aggregates them. By aligning summary sentences to document sections with ROUGE-L and training on numerous section-level source-target pairs, DANCER reduces input/output lengths and benefits from parallelizable training. Experiments on arXiv and PubMed show that DANCER improves a range of models, with PEGASUS paired with DANCER achieving results on par with state-of-the-art baselines, and Pointer-Generator variants also benefiting significantly. The approach is simple, flexible, and scalable, offering a practical pathway to apply advanced summarization to arbitrarily long documents across domains.

Abstract

We present a novel divide-and-conquer method for the neural summarization of long documents. Our method exploits the discourse structure of the document and uses sentence similarity to split the problem into an ensemble of smaller summarization problems. In particular, we break a long document and its summary into multiple source-target pairs, which are used for training a model that learns to summarize each part of the document separately. These partial summaries are then combined in order to produce a final complete summary. With this approach we can decompose the problem of long document summarization into smaller and simpler problems, reducing computational complexity and creating more training examples, which at the same time contain less noise in the target summaries compared to the standard approach. We demonstrate that this approach paired with different summarization models, including sequence-to-sequence RNNs and Transformers, can lead to improved summarization performance. Our best models achieve results that are on par with the state-of-the-art in two two publicly available datasets of academic articles.

A Divide-and-Conquer Approach to the Summarization of Long Documents

TL;DR

The paper tackles the high compute and noise challenges of long-document summarization by introducing DANCER, a divide-and-conquer framework that learns per-section summaries and then aggregates them. By aligning summary sentences to document sections with ROUGE-L and training on numerous section-level source-target pairs, DANCER reduces input/output lengths and benefits from parallelizable training. Experiments on arXiv and PubMed show that DANCER improves a range of models, with PEGASUS paired with DANCER achieving results on par with state-of-the-art baselines, and Pointer-Generator variants also benefiting significantly. The approach is simple, flexible, and scalable, offering a practical pathway to apply advanced summarization to arbitrarily long documents across domains.

Abstract

We present a novel divide-and-conquer method for the neural summarization of long documents. Our method exploits the discourse structure of the document and uses sentence similarity to split the problem into an ensemble of smaller summarization problems. In particular, we break a long document and its summary into multiple source-target pairs, which are used for training a model that learns to summarize each part of the document separately. These partial summaries are then combined in order to produce a final complete summary. With this approach we can decompose the problem of long document summarization into smaller and simpler problems, reducing computational complexity and creating more training examples, which at the same time contain less noise in the target summaries compared to the standard approach. We demonstrate that this approach paired with different summarization models, including sequence-to-sequence RNNs and Transformers, can lead to improved summarization performance. Our best models achieve results that are on par with the state-of-the-art in two two publicly available datasets of academic articles.

Paper Structure

This paper contains 24 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The distribution of summary sentences per section type after the section classification and alignment using DANCER. For the PubMed dataset the sentences are more evenly distributed among the introduction, methods, results and conclusion sections while for the arXiv dataset the majority of sentences is assigned to the introduction and methods section. In both dataset it can be clearly seen that the literature section is almost never matched with any summary sentences.
  • Figure 2: Architecture of the core Pointer-Generator model. For each decoder timestep the model has a probability to either generate words from a fixed vocabulary or copy words from the source text.
  • Figure 3: The percentage of N-grams that are copied directly from the source to the target summary for both datasets. The percentages are high for both datasets but for the PubMed dataset we observe a higher percentage of copied 2-grams, 3-grams, 4-grams. This implies that the abstracts of the articles are in fact very much extractive and as a result this dataset favors extractive approaches more.