Table of Contents
Fetching ...

A Modular Approach for Multimodal Summarization of TV Shows

Louis Mahon, Mirella Lapata

TL;DR

This work tackles long-form multimodal TV summarization by proposing a modular pipeline that processes scenes independently, reorders them to minimize context switching, translates visual content to text, summarizes dialogue, and fuses scene-level outputs into a final episode summary. It introduces PRisma, a fact-based evaluation metric based on atomic facts, to assess precision and recall relative to gold summaries, addressing limitations of ROUGE for factuality. Empirically, the modular approach improves over end-to-end baselines on SummScreen3D and correlates more strongly with human judgments, with the transcript-derived content being the most impactful component. The approach offers a controllable, interpretable framework for long-form multimodal summarization and provides a practical, scalable metric to gauge factual quality, albeit with notable computational cost and ongoing gaps to human-level performance.

Abstract

In this paper we address the task of summarizing television shows, which touches key areas in AI research: complex reasoning, multiple modalities, and long narratives. We present a modular approach where separate components perform specialized sub-tasks which we argue affords greater flexibility compared to end-to-end methods. Our modules involve detecting scene boundaries, reordering scenes so as to minimize the number of cuts between different events, converting visual information to text, summarizing the dialogue in each scene, and fusing the scene summaries into a final summary for the entire episode. We also present a new metric, PRISMA (Precision and Recall EvaluatIon of Summary FActs), to measure both precision and recall of generated summaries, which we decompose into atomic facts. Tested on the recently released SummScreen3D dataset, our method produces higher quality summaries than comparison models, as measured with ROUGE and our new fact-based metric, and as assessed by human evaluators.

A Modular Approach for Multimodal Summarization of TV Shows

TL;DR

This work tackles long-form multimodal TV summarization by proposing a modular pipeline that processes scenes independently, reorders them to minimize context switching, translates visual content to text, summarizes dialogue, and fuses scene-level outputs into a final episode summary. It introduces PRisma, a fact-based evaluation metric based on atomic facts, to assess precision and recall relative to gold summaries, addressing limitations of ROUGE for factuality. Empirically, the modular approach improves over end-to-end baselines on SummScreen3D and correlates more strongly with human judgments, with the transcript-derived content being the most impactful component. The approach offers a controllable, interpretable framework for long-form multimodal summarization and provides a practical, scalable metric to gauge factual quality, albeit with notable computational cost and ongoing gaps to human-level performance.

Abstract

In this paper we address the task of summarizing television shows, which touches key areas in AI research: complex reasoning, multiple modalities, and long narratives. We present a modular approach where separate components perform specialized sub-tasks which we argue affords greater flexibility compared to end-to-end methods. Our modules involve detecting scene boundaries, reordering scenes so as to minimize the number of cuts between different events, converting visual information to text, summarizing the dialogue in each scene, and fusing the scene summaries into a final summary for the entire episode. We also present a new metric, PRISMA (Precision and Recall EvaluatIon of Summary FActs), to measure both precision and recall of generated summaries, which we decompose into atomic facts. Tested on the recently released SummScreen3D dataset, our method produces higher quality summaries than comparison models, as measured with ROUGE and our new fact-based metric, and as assessed by human evaluators.
Paper Structure (31 sections, 12 equations, 3 figures, 13 tables, 2 algorithms)

This paper contains 31 sections, 12 equations, 3 figures, 13 tables, 2 algorithms.

Figures (3)

  • Figure 1: Graphical depiction of our approach for long-form multimodal summarization where different subtasks are performed by five, specialized modules (shown in different colors). We use simplified summaries for display and show only four scenes. This full episode (As the World Turns aired 01-06-05, contains 29 scenes.
  • Figure 2: A selected keyframe from Scene 2 in One Life to Live, (aired 10-18-10), which Kosmos-2 captions as "a man is kissing a woman". Our post-processing method to insert character names transforms this caption to "Brody is kissing Jessica".
  • Figure 3: Correlation between all pairs of metrics that we report in Section \ref{['sec:experimental-eval']}. All are weakly correlated, with a stronger correlation between the different varieties of ROUGE.