Table of Contents
Fetching ...

Hierarchical Transformers for Multi-Document Summarization

Yang Liu, Mirella Lapata

TL;DR

This paper tackles abstractive multi-document summarization by introducing a hierarchical Transformer that encodes and relates multiple source paragraphs through local intra-paragraph and global inter-paragraph attention. A learning-based paragraph ranking stage selects informative inputs, while a graph-informed attention mechanism allows external lexical similarity or discourse graphs to guide cross-document reasoning. Evaluations on the WikiSum dataset show substantial improvements over strong baselines, with additional gains when using discourse graphs and longer input contexts. The approach demonstrates scalable modeling of cross-document structure and suggests promising directions for applying hierarchical Transformers to question answering and related inference tasks.

Abstract

In this paper, we develop a neural summarization model which can effectively process multiple input documents and distill Transformer architecture with the ability to encode documents in a hierarchical manner. We represent cross-document relationships via an attention mechanism which allows to share information as opposed to simply concatenating text spans and processing them as a flat sequence. Our model learns latent dependencies among textual units, but can also take advantage of explicit graph representations focusing on similarity or discourse relations. Empirical results on the WikiSum dataset demonstrate that the proposed architecture brings substantial improvements over several strong baselines.

Hierarchical Transformers for Multi-Document Summarization

TL;DR

This paper tackles abstractive multi-document summarization by introducing a hierarchical Transformer that encodes and relates multiple source paragraphs through local intra-paragraph and global inter-paragraph attention. A learning-based paragraph ranking stage selects informative inputs, while a graph-informed attention mechanism allows external lexical similarity or discourse graphs to guide cross-document reasoning. Evaluations on the WikiSum dataset show substantial improvements over strong baselines, with additional gains when using discourse graphs and longer input contexts. The approach demonstrates scalable modeling of cross-document structure and suggests promising directions for applying hierarchical Transformers to question answering and related inference tasks.

Abstract

In this paper, we develop a neural summarization model which can effectively process multiple input documents and distill Transformer architecture with the ability to encode documents in a hierarchical manner. We represent cross-document relationships via an attention mechanism which allows to share information as opposed to simply concatenating text spans and processing them as a flat sequence. Our model learns latent dependencies among textual units, but can also take advantage of explicit graph representations focusing on similarity or discourse relations. Empirical results on the WikiSum dataset demonstrate that the proposed architecture brings substantial improvements over several strong baselines.

Paper Structure

This paper contains 26 sections, 15 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Pipeline of our multi-document summarization system. $L$ source paragraphs are first ranked and the $L'$-best ones serve as input to an encoder-decoder model which generates the target summary.
  • Figure 2: A global transformer layer. Different colors indicate different heads in multi-head pooling and inter-paragraph attention.