Table of Contents
Fetching ...

Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention

Vasilii Korolkov, Andrey Yanchenko

TL;DR

This work treats intro and credit detection as a per-second binary sequence labeling problem and presents a visual-only, deep learning solution that leverages CLIP embeddings and multihead attention. By processing 60-frame sliding windows at 1 FPS and applying learnable positional encodings, the model captures temporal structure to distinguish intros/credits from main content. The approach achieves strong results (test accuracy ≈ 94%, precision ≈ 89%, recall ≈ 97%, F1 ≈ 91%) and outperforms heuristic and CNN-GRU baselines, while being optimized for real-time deployment via ONNX and FP16. Practical implications include automated content indexing, highlight generation, and video summarization, with future directions toward multimodal inputs and broader video types. Overall, the study demonstrates that transformer-based temporal modeling with CLIP features yields robust, scalable intro/credit detection suitable for streaming platforms and large video archives.

Abstract

Detecting transitions between intro/credits and main content in videos is a crucial task for content segmentation, indexing, and recommendation systems. Manual annotation of such transitions is labor-intensive and error-prone, while heuristic-based methods often fail to generalize across diverse video styles. In this work, we introduce a deep learning-based approach that formulates the problem as a sequence-to-sequence classification task, where each second of a video is labeled as either "intro" or "film." Our method extracts frames at a fixed rate of 1 FPS, encodes them using CLIP (Contrastive Language-Image Pretraining), and processes the resulting feature representations with a multihead attention model incorporating learned positional encoding. The system achieves an F1-score of 91.0%, Precision of 89.0%, and Recall of 97.0% on the test set, and is optimized for real-time inference, achieving 11.5 FPS on CPU and 107 FPS on high-end GPUs. This approach has practical applications in automated content indexing, highlight detection, and video summarization. Future work will explore multimodal learning, incorporating audio features and subtitles to further enhance detection accuracy.

Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention

TL;DR

This work treats intro and credit detection as a per-second binary sequence labeling problem and presents a visual-only, deep learning solution that leverages CLIP embeddings and multihead attention. By processing 60-frame sliding windows at 1 FPS and applying learnable positional encodings, the model captures temporal structure to distinguish intros/credits from main content. The approach achieves strong results (test accuracy ≈ 94%, precision ≈ 89%, recall ≈ 97%, F1 ≈ 91%) and outperforms heuristic and CNN-GRU baselines, while being optimized for real-time deployment via ONNX and FP16. Practical implications include automated content indexing, highlight generation, and video summarization, with future directions toward multimodal inputs and broader video types. Overall, the study demonstrates that transformer-based temporal modeling with CLIP features yields robust, scalable intro/credit detection suitable for streaming platforms and large video archives.

Abstract

Detecting transitions between intro/credits and main content in videos is a crucial task for content segmentation, indexing, and recommendation systems. Manual annotation of such transitions is labor-intensive and error-prone, while heuristic-based methods often fail to generalize across diverse video styles. In this work, we introduce a deep learning-based approach that formulates the problem as a sequence-to-sequence classification task, where each second of a video is labeled as either "intro" or "film." Our method extracts frames at a fixed rate of 1 FPS, encodes them using CLIP (Contrastive Language-Image Pretraining), and processes the resulting feature representations with a multihead attention model incorporating learned positional encoding. The system achieves an F1-score of 91.0%, Precision of 89.0%, and Recall of 97.0% on the test set, and is optimized for real-time inference, achieving 11.5 FPS on CPU and 107 FPS on high-end GPUs. This approach has practical applications in automated content indexing, highlight detection, and video summarization. Future work will explore multimodal learning, incorporating audio features and subtitles to further enhance detection accuracy.

Paper Structure

This paper contains 51 sections, 8 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Experimental results with alternative architectures and regularization strategies.
  • Figure 2: Performance metrics over training iterations. Each graph shows the progression of the respective metric on both training and validation sets.