Table of Contents
Fetching ...

MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

Chenguang Zhu, Yang Liu, Jie Mei, Michael Zeng

TL;DR

MediaSum addresses the lack of large-scale public dialogue summarization data by aggregating NPR and CNN transcripts with abstractive summaries and topic labels. The authors introduce a topic-segmentation approach for CNN multi-topic interviews and report dataset-scale statistics and a positional bias analysis distinguishing TV and radio formats. They benchmark several models, showing that pre-trained architectures attain strong performance and that fine-tuning on MediaSum improves other dialogue summarization tasks via transfer learning. This dataset's scale and diversity make it a valuable resource for training and evaluating multi-party dialogue summarization systems.

Abstract

MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic descriptions as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MediaSum can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.

MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

TL;DR

MediaSum addresses the lack of large-scale public dialogue summarization data by aggregating NPR and CNN transcripts with abstractive summaries and topic labels. The authors introduce a topic-segmentation approach for CNN multi-topic interviews and report dataset-scale statistics and a positional bias analysis distinguishing TV and radio formats. They benchmark several models, showing that pre-trained architectures attain strong performance and that fine-tuning on MediaSum improves other dialogue summarization tasks via transfer learning. This dataset's scale and diversity make it a valuable resource for training and evaluating multi-party dialogue summarization systems.

Abstract

MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic descriptions as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MediaSum can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.

Paper Structure

This paper contains 15 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: The frequency of the non-stop summary words appearing at different positions of the dialogue. The positions are normalized to [0, 100].
  • Figure 2: Distribution of the number of turns, speakers, dialogue words and summary words in the dialogues of MediaSum dataset.
  • Figure 3: The frequency of non-stop topic words appearing in different positions of the dialogue. The dialogues are from the original CNN transcripts with one topic. The positions are normalized to [0, 100].