Table of Contents
Fetching ...

Automatic Funny Scene Extraction from Long-form Cinematic Videos

Sibendu Paul, Haotian Jiang, Caren Chen

TL;DR

This work introduces an end-to-end pipeline for automatically extracting humorous scenes from long-form cinematic content by integrating shot detection, multimodal scene localization, and humor tagging. The scene localization relies on a multimodal, shot-level representation learned through guided triplet mining and an X-CLIP-based visual encoder paired with a DINO projection head, augmented by shot-level text captions. Humor tagging combines audio cues with long-text contextual reasoning via a ColBERT-based model, supplemented by a guardrail to filter improper humor, and a heuristic scoring mechanism to rank top clips. Across OVSD and MovieNet-SSeg benchmarks, the approach achieves an 18.3% AP improvement in scene detection, a 0.834 F1 score for humor detection in long text, and strong curator-based evaluations (87% humor accuracy on main content, 98% localization), with demonstrated generalization to trailers and deployment potential for snackable content generation in streaming workflows.

Abstract

Automatically extracting engaging and high-quality humorous scenes from cinematic titles is pivotal for creating captivating video previews and snackable content, boosting user engagement on streaming platforms. Long-form cinematic titles, with their extended duration and complex narratives, challenge scene localization, while humor's reliance on diverse modalities and its nuanced style add further complexity. This paper introduces an end-to-end system for automatically identifying and ranking humorous scenes from long-form cinematic titles, featuring shot detection, multimodal scene localization, and humor tagging optimized for cinematic content. Key innovations include a novel scene segmentation approach combining visual and textual cues, improved shot representations via guided triplet mining, and a multimodal humor tagging framework leveraging both audio and text. Our system achieves an 18.3% AP improvement over state-of-the-art scene detection on the OVSD dataset and an F1 score of 0.834 for detecting humor in long text. Extensive evaluations across five cinematic titles demonstrate 87% of clips extracted by our pipeline are intended to be funny, while 98% of scenes are accurately localized. With successful generalization to trailers, these results showcase the pipeline's potential to enhance content creation workflows, improve user engagement, and streamline snackable content generation for diverse cinematic media formats.

Automatic Funny Scene Extraction from Long-form Cinematic Videos

TL;DR

This work introduces an end-to-end pipeline for automatically extracting humorous scenes from long-form cinematic content by integrating shot detection, multimodal scene localization, and humor tagging. The scene localization relies on a multimodal, shot-level representation learned through guided triplet mining and an X-CLIP-based visual encoder paired with a DINO projection head, augmented by shot-level text captions. Humor tagging combines audio cues with long-text contextual reasoning via a ColBERT-based model, supplemented by a guardrail to filter improper humor, and a heuristic scoring mechanism to rank top clips. Across OVSD and MovieNet-SSeg benchmarks, the approach achieves an 18.3% AP improvement in scene detection, a 0.834 F1 score for humor detection in long text, and strong curator-based evaluations (87% humor accuracy on main content, 98% localization), with demonstrated generalization to trailers and deployment potential for snackable content generation in streaming workflows.

Abstract

Automatically extracting engaging and high-quality humorous scenes from cinematic titles is pivotal for creating captivating video previews and snackable content, boosting user engagement on streaming platforms. Long-form cinematic titles, with their extended duration and complex narratives, challenge scene localization, while humor's reliance on diverse modalities and its nuanced style add further complexity. This paper introduces an end-to-end system for automatically identifying and ranking humorous scenes from long-form cinematic titles, featuring shot detection, multimodal scene localization, and humor tagging optimized for cinematic content. Key innovations include a novel scene segmentation approach combining visual and textual cues, improved shot representations via guided triplet mining, and a multimodal humor tagging framework leveraging both audio and text. Our system achieves an 18.3% AP improvement over state-of-the-art scene detection on the OVSD dataset and an F1 score of 0.834 for detecting humor in long text. Extensive evaluations across five cinematic titles demonstrate 87% of clips extracted by our pipeline are intended to be funny, while 98% of scenes are accurately localized. With successful generalization to trailers, these results showcase the pipeline's potential to enhance content creation workflows, improve user engagement, and streamline snackable content generation for diverse cinematic media formats.
Paper Structure (36 sections, 4 equations, 4 figures, 6 tables)

This paper contains 36 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Funny scene extraction pipeline overview. The pipeline consists of three main blocks: shot detection, shot representation extraction, and scene merging, followed by downstream binary tagging and humor scoring.
  • Figure 2: Scene detection module overview: Triplets generated from MovieNet-SSeg ground-truth boundaries; spatial and temporal augmentations applied during contrastive pretraining with triplet loss; for each video, we extract shot-level video caption and use shot text-encoder to extract shot-level text features; fine-tuning aggregates neighboring shot features and trains MLP layers via supervised learning.
  • Figure 3: Multi-modal humor tagging pipeline overview. (1) Extract audio and detect laughter, (2) transcribe audio and analyze funny conversations, (3) filter improper humor, then score and and output the ranked funny scene videos.
  • Figure 4: Example extracted humorous scenes and instances of improper humor from various long-form cinematic titles. Observed laughter is marked with orange boxes, while laugh-worthy dialogue or humorous context is highlighted in violet.