Table of Contents
Fetching ...

PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

Wen Xiao, Iz Beltagy, Giuseppe Carenini, Arman Cohan

TL;DR

PRIMERA introduces a pyramid-based masking pretraining approach for multi-document summarization that leverages a simple, concatenation-based input structure and Longformer-Encoder-Decoder to efficiently process document clusters. The core novelty, Entity Pyramid Masking, selects cross-document salient sentences via entity-frequency across a cluster to train a Gap Sentence Generation objective, enabling strong zero-/few-/full-supervised performance. Across six datasets from three domains, PRIMERA consistently surpasses state-of-the-art pretrained and dataset-specific models, with notable gains in low-resource settings and favorable human-evaluation results. The work highlights the value of cross-document saliency-aware pretraining and presents a practical, scalable path for multi-document summarization without heavy dataset-specific architectures.

Abstract

We introduce PRIMERA, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach the model to connect and aggregate information across documents. It also uses efficient encoder-decoder transformers to simplify the processing of concatenated input documents. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on zero-shot, few-shot and full-supervised settings, PRIMERA outperforms current state-of-the-art dataset-specific and pre-trained models on most of these settings with large margins. The code and pre-trained models can be found at \url{https://github.com/allenai/PRIMER}.

PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

TL;DR

PRIMERA introduces a pyramid-based masking pretraining approach for multi-document summarization that leverages a simple, concatenation-based input structure and Longformer-Encoder-Decoder to efficiently process document clusters. The core novelty, Entity Pyramid Masking, selects cross-document salient sentences via entity-frequency across a cluster to train a Gap Sentence Generation objective, enabling strong zero-/few-/full-supervised performance. Across six datasets from three domains, PRIMERA consistently surpasses state-of-the-art pretrained and dataset-specific models, with notable gains in low-resource settings and favorable human-evaluation results. The work highlights the value of cross-document saliency-aware pretraining and presents a practical, scalable path for multi-document summarization without heavy dataset-specific architectures.

Abstract

We introduce PRIMERA, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach the model to connect and aggregate information across documents. It also uses efficient encoder-decoder transformers to simplify the processing of concatenated input documents. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on zero-shot, few-shot and full-supervised settings, PRIMERA outperforms current state-of-the-art dataset-specific and pre-trained models on most of these settings with large margins. The code and pre-trained models can be found at \url{https://github.com/allenai/PRIMER}.

Paper Structure

This paper contains 41 sections, 1 equation, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: Primera vs existing pretrained models.
  • Figure 2: Model Structure of Primera.
  • Figure 3: An example on sentence selection by Principle vs our Entity Pyramid strategy. Italic text in red is the sentence with the highest Principle ROUGE scores, which is thereby chosen by the Principle Strategy. Most frequent entity 'Colorado' is shown with blue, followed by the Pyramid ROUGE scores in parenthesis. The final selected sentence by Entity Pyramid strategy is in italic. which is a better pseudo-summary than the ones selected by the Principle strategy.
  • Figure 4: The Entity Pyramid Strategy to select salient sentences for masking. Pyramid entity is based on the frequency of entities in the documents. The most representative sentence are chosen based on Cluster ROUGE for each entity with frequency $>1$, e.g. Sentence 10 in Document 2 for Entity 1.
  • Figure 5: The AVG ROUGE scores (R-1, R-2 and R-L) of the pretrained models with 0, 10 and 100 training data with variance. All the results of few-shot experiments (10 and 100) are obtained by the average of 5 random runs (with std, and the same set of seeds shared by all the models). Note that DUC2004 only has 50 examples, we use 20/10/20 for train/valid/test in the few-shot experiments.
  • ...and 3 more figures