Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas

Carlo Bretti; Pascal Mettes; Hendrik Vincent Koops; Daan Odijk; Nanne van Noord

Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas

Carlo Bretti, Pascal Mettes, Hendrik Vincent Koops, Daan Odijk, Nanne van Noord

TL;DR

The paper tackles the challenge of predicting trailerness in long-form soap operas to aid editors in trailer creation. It introduces a multi-modal, multi-scale Trailerness Transformer that processes visual and textual signals at clip- and shot-level scales, trained with editor-derived labels from the GTST dataset. The study shows that combining modalities and scales yields higher trailerness predictions, achieving a best F1 around 9.2% on GTST and outperforming baselines like random, MLP, and frame-based approaches. By releasing the GTST dataset and code, the work provides a practical, open pathway for improving trailer generation in soap operas and other long-form content.

Abstract

Creating a trailer requires carefully picking out and piecing together brief enticing moments out of a longer video, making it a challenging and time-consuming task. This requires selecting moments based on both visual and dialogue information. We introduce a multi-modal method for predicting the trailerness to assist editors in selecting trailer-worthy moments from long-form videos. We present results on a newly introduced soap opera dataset, demonstrating that predicting trailerness is a challenging task that benefits from multi-modal information. Code is available at https://github.com/carlobretti/cliffhanger

Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas

TL;DR

Abstract

Paper Structure (16 sections, 4 equations, 4 figures, 3 tables)

This paper contains 16 sections, 4 equations, 4 figures, 3 tables.

Introduction
Related Work
Video Summarization
Trailer Generation
Method
The Trailerness of Video
Trailer Labels from Editor Selections
Multi-scale and Multi-modal Trailerness Transformer
The GTST Dataset
Experiments
Setup
Evaluating Modalities and Temporal Scales
Combining Modalities and Temporal Scales
Comparisons to Baselines
Qualitative Results
...and 1 more sections

Figures (4)

Figure 1: Estimating trailerness in videos with our Trailerness Transformer. Given a video denoting a movie or tv series episode, we first encode clip-level and shot-level encodings for both the visual and textual video modalities. We then train transformers for each combination of modality and temporal scale, after which we aggregate the trailerness predictions of all transformers.
Figure 2: Baseline comparisons. An MLP-based architecture outperforms the random baseline and a frame-based summarization method (VASNet fajtlSummarizingVideosAttention2019), and our model outperforms them all by incorporating sequential order and temporal positioning.
Figure 3: Qualitative results for visual and text streams at a clip level individually. Emotionally-charged visuals and urgent calls to action in text yield higher trailerness than transitory visuals and playful subtitles.
Figure 4: Qualitative results for our best-performing model. Scenes with bright visuals and emphatic dialogue yield higher trailerness than scenes with generic visuals and a lack of dialogue.

Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas

TL;DR

Abstract

Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas

Authors

TL;DR

Abstract

Table of Contents

Figures (4)