Table of Contents
Fetching ...

"Previously on ..." From Recaps to Story Summarization

Aditya Kumar Singh, Dhruv Srivastava, Makarand Tapaswi

TL;DR

The paper tackles long-form multimodal storytelling by leveraging TV episode recaps to supervise extractive video-text summarization. It introduces PlotSnap, a dataset built from two crime-thriller series, and TaleSumm, a two-level hierarchical Transformer that first builds shot- and dialog-level representations and then models episode-scale interactions within local story groups to predict per-shot and per-dialogue importance. The approach yields state-of-the-art results on PlotSnap and competitive performance on classic video summarization benchmarks, with strong cross-season and cross-series generalization. By using recap-based supervision, the work demonstrates a scalable pathway to multimodal story summarization for long videos, with practical implications for viewing aids and content analysis.

Abstract

We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps and long episodes of 40 minutes. Story summarization labels are unlocked by matching recap shots to corresponding sub-stories in the episode. We propose a hierarchical model TaleSumm that processes entire episodes by creating compact shot and dialog representations, and predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization, our method extracts multiple plot points from long videos. We present a thorough evaluation on story summarization, including promising cross-series generalization. TaleSumm also shows good results on classic video summarization benchmarks.

"Previously on ..." From Recaps to Story Summarization

TL;DR

The paper tackles long-form multimodal storytelling by leveraging TV episode recaps to supervise extractive video-text summarization. It introduces PlotSnap, a dataset built from two crime-thriller series, and TaleSumm, a two-level hierarchical Transformer that first builds shot- and dialog-level representations and then models episode-scale interactions within local story groups to predict per-shot and per-dialogue importance. The approach yields state-of-the-art results on PlotSnap and competitive performance on classic video summarization benchmarks, with strong cross-season and cross-series generalization. By using recap-based supervision, the work demonstrates a scalable pathway to multimodal story summarization for long videos, with practical implications for viewing aids and content analysis.

Abstract

We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps and long episodes of 40 minutes. Story summarization labels are unlocked by matching recap shots to corresponding sub-stories in the episode. We propose a hierarchical model TaleSumm that processes entire episodes by creating compact shot and dialog representations, and predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization, our method extracts multiple plot points from long videos. We present a thorough evaluation on story summarization, including promising cross-series generalization. TaleSumm also shows good results on classic video summarization benchmarks.
Paper Structure (83 sections, 10 equations, 11 figures, 13 tables)

This paper contains 83 sections, 10 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: We illustrate how TV show recaps can be used to generate labels for multimodal story summarization. The top half features the recap shown at the beginning of the episode S08E23 based on key moments (shots and dialogs) from S08E22 of the TV series 24. As recaps help viewers recall essential story events, we extend these aligned segments to create summarization labels (visualized in the bottom half where the actual shots and dialogs inherited from recap are marked in deep red). For example, in the sub-story (left), the recap hints at Jack Bauer relaying classified information to the press, while the summary presents the complete sub-story, including Logan informing President Taylor about their failure to catch Jack and their disagreement over muzzling the press.
  • Figure 2: (A) TaleSumm ingests all video shots and dialogs of the episode and encodes them using (B) and (C). Based on temporal order, we combine tokens into local story groups (illustration shows small groups of 2 shots and 0-2 utterances). To each group, we append a group token and add multiple embeddings, before feeding them to the the episode-level Transformer ($\mathsf{ET}$). For each shot or dialog token, a linear classifier predicts its importance. (B) Video shot encoder. For each frame, representations from multiple backbones are fused using attention ($\boxplus$). We feed these to a shot Transformer encoder $\mathsf{ST}$, and tap a shot-level representation from the $\mathsf{CLS}$ token. (C) Utterance encoder uses a fine-tuned language model and avg-pooling across all words of the utterance. (D) Self-attention mask illustrates the block-diagonal self-attention structure across the episode. Group tokens across the episode (purple squares) communicate with each other. (E) Multiple embeddings are added to the tokens to capture modality type, time, and membership to a local story group.
  • Figure 3: TaleSumm predictions on S06E22 of 24 (test set). "Ours" filled-plot illustrates the importance score profile over time, where orange patches indicate story segments selected for summarization. Annotations are shown below: ground-truth ( GT), fandom ( F), and human annotated ( H). The story: Amid the high-stakes sequence depicted in the selected groups 1-3, Zhou Yong's team captures Josh Bauer, leading to a firefight with Jack Bauer, who seeks Josh's location. Negotiations with Phillip Bauer over Josh's return for a vital circuit board escalate global tensions between Russia and the USA. Simultaneously, Mike Doyle defies Jack's wishes and departs with Josh by helicopter (segment 7). Parallely, Lisa, backed by Tom Lennox, confronts a Russian agent, leading to her injury (4, 6). Morris attempts to console Nadia for Milo's loss at CTU in 5. Escalating global tensions and the imminent showdown mark the episode.
  • Figure 4: Retrieval results for Recap from Episode Frames with DenseNet (Top) v/s ResNet (bottom). We observe qualitatively that DenseNet is able to match to the correct frames from the episode more often.
  • Figure 5: Flowchart for identifying shots from the episode that appear in a recap and can be used as weak labels for story summarization. The process involves identifying the list of high-scoring matching frames, indexing the shots, and then preventing spurious matches by looking for high-scoring matches within a bounded duration. The flowchart presents an example of the process used to identify the set of shots $\mathcal{N}_s$ from the episode that match to the recap shot.
  • ...and 6 more figures