Table of Contents
Fetching ...

Select and Summarize: Scene Saliency for Movie Script Summarization

Rohit Saxena, Frank Keller

TL;DR

The paper addresses the challenge of long-form movie script summarization under memory constraints by introducing scene saliency, defined as a scene mentioned in the summary, and the MENSA dataset of 100 movies with human-aligned scene-to-summary annotations. It proposes a two-stage Select & Summ approach: (1) predict salient scenes via a supervised scene-saliency classifier trained on silver-standard labels, and (2) generate abstractive summaries using only the salient scenes with a Longformer-based encoder-decoder. The method achieves state-of-the-art results on ScriptBase with ROUGE and BERTScore, and shows strong QA-based factual coverage improvements; zero-shot experiments on SummScreen-FD indicate competitive performance with fewer parameters. Overall, the work demonstrates that explicit content selection via salient scenes can substantially reduce input size while improving the quality and factual consistency of movie-script summaries, with practical implications for scalable long-form summarization.

Abstract

Abstractive summarization for long-form narrative texts such as movie scripts is challenging due to the computational and memory constraints of current language models. A movie script typically comprises a large number of scenes; however, only a fraction of these scenes are salient, i.e., important for understanding the overall narrative. The salience of a scene can be operationalized by considering it as salient if it is mentioned in the summary. Automatically identifying salient scenes is difficult due to the lack of suitable datasets. In this work, we introduce a scene saliency dataset that consists of human-annotated salient scenes for 100 movies. We propose a two-stage abstractive summarization approach which first identifies the salient scenes in script and then generates a summary using only those scenes. Using QA-based evaluation, we show that our model outperforms previous state-of-the-art summarization methods and reflects the information content of a movie more accurately than a model that takes the whole movie script as input.

Select and Summarize: Scene Saliency for Movie Script Summarization

TL;DR

The paper addresses the challenge of long-form movie script summarization under memory constraints by introducing scene saliency, defined as a scene mentioned in the summary, and the MENSA dataset of 100 movies with human-aligned scene-to-summary annotations. It proposes a two-stage Select & Summ approach: (1) predict salient scenes via a supervised scene-saliency classifier trained on silver-standard labels, and (2) generate abstractive summaries using only the salient scenes with a Longformer-based encoder-decoder. The method achieves state-of-the-art results on ScriptBase with ROUGE and BERTScore, and shows strong QA-based factual coverage improvements; zero-shot experiments on SummScreen-FD indicate competitive performance with fewer parameters. Overall, the work demonstrates that explicit content selection via salient scenes can substantially reduce input size while improving the quality and factual consistency of movie-script summaries, with practical implications for scalable long-form summarization.

Abstract

Abstractive summarization for long-form narrative texts such as movie scripts is challenging due to the computational and memory constraints of current language models. A movie script typically comprises a large number of scenes; however, only a fraction of these scenes are salient, i.e., important for understanding the overall narrative. The salience of a scene can be operationalized by considering it as salient if it is mentioned in the summary. Automatically identifying salient scenes is difficult due to the lack of suitable datasets. In this work, we introduce a scene saliency dataset that consists of human-annotated salient scenes for 100 movies. We propose a two-stage abstractive summarization approach which first identifies the salient scenes in script and then generates a summary using only those scenes. Using QA-based evaluation, we show that our model outperforms previous state-of-the-art summarization methods and reflects the information content of a movie more accurately than a model that takes the whole movie script as input.
Paper Structure (27 sections, 5 equations, 2 figures, 10 tables)

This paper contains 27 sections, 5 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: The architecture of the scene saliency detection and summarization models. The models are trained in a pipeline where salient scene detection is trained separately.
  • Figure 2: Distribution of movie length from the training set for full text and only the salient scenes.