Table of Contents
Fetching ...

MovieSum: An Abstractive Summarization Dataset for Movie Screenplays

Rohit Saxena, Frank Keller

TL;DR

The paper introduces MovieSum, a large, professionally formatted dataset of 2200 English movie screenplays paired with Wikipedia plot summaries to advance the abstractive summarization of long, narrative-rich scripts. It provides IMDb metadata, broad genre and temporal coverage, and a thorough dataset analysis, highlighting high abstractiveness in summaries. Through extensive experiments with zero-shot and fine-tuned long-context models (including LED, LongT5, Pegasus-X, Vicuna, and FLAN-UL2), the study demonstrates the dataset's difficulty and the current limits of long-input summarization, while showing the benefits of longer input contexts and content-aware modeling. The work emphasizes screenplay structure and external knowledge integration as key directions for future research, offering a valuable benchmark for narrative understanding and a foundation for developing more capable, context-aware summarization systems.

Abstract

Movie screenplay summarization is challenging, as it requires an understanding of long input contexts and various elements unique to movies. Large language models have shown significant advancements in document summarization, but they often struggle with processing long input contexts. Furthermore, while television transcripts have received attention in recent studies, movie screenplay summarization remains underexplored. To stimulate research in this area, we present a new dataset, MovieSum, for abstractive summarization of movie screenplays. This dataset comprises 2200 movie screenplays accompanied by their Wikipedia plot summaries. We manually formatted the movie screenplays to represent their structural elements. Compared to existing datasets, MovieSum possesses several distinctive features: (1) It includes movie screenplays, which are longer than scripts of TV episodes. (2) It is twice the size of previous movie screenplay datasets. (3) It provides metadata with IMDb IDs to facilitate access to additional external knowledge. We also show the results of recently released large language models applied to summarization on our dataset to provide a detailed baseline.

MovieSum: An Abstractive Summarization Dataset for Movie Screenplays

TL;DR

The paper introduces MovieSum, a large, professionally formatted dataset of 2200 English movie screenplays paired with Wikipedia plot summaries to advance the abstractive summarization of long, narrative-rich scripts. It provides IMDb metadata, broad genre and temporal coverage, and a thorough dataset analysis, highlighting high abstractiveness in summaries. Through extensive experiments with zero-shot and fine-tuned long-context models (including LED, LongT5, Pegasus-X, Vicuna, and FLAN-UL2), the study demonstrates the dataset's difficulty and the current limits of long-input summarization, while showing the benefits of longer input contexts and content-aware modeling. The work emphasizes screenplay structure and external knowledge integration as key directions for future research, offering a valuable benchmark for narrative understanding and a foundation for developing more capable, context-aware summarization systems.

Abstract

Movie screenplay summarization is challenging, as it requires an understanding of long input contexts and various elements unique to movies. Large language models have shown significant advancements in document summarization, but they often struggle with processing long input contexts. Furthermore, while television transcripts have received attention in recent studies, movie screenplay summarization remains underexplored. To stimulate research in this area, we present a new dataset, MovieSum, for abstractive summarization of movie screenplays. This dataset comprises 2200 movie screenplays accompanied by their Wikipedia plot summaries. We manually formatted the movie screenplays to represent their structural elements. Compared to existing datasets, MovieSum possesses several distinctive features: (1) It includes movie screenplays, which are longer than scripts of TV episodes. (2) It is twice the size of previous movie screenplay datasets. (3) It provides metadata with IMDb IDs to facilitate access to additional external knowledge. We also show the results of recently released large language models applied to summarization on our dataset to provide a detailed baseline.
Paper Structure (19 sections, 5 figures, 3 tables)

This paper contains 19 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Distribution of movie genres and release years in the dataset.
  • Figure 2: Coverage-Density plot of the summaries.
  • Figure 3: Distribution of movie script length from the training set.
  • Figure 4: Distribution of summary length from the training set.
  • Figure 5: Example of cleanly formatted scenes from a movie screenplay.