Table of Contents
Fetching ...

Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model

Elaheh Baharlouei, Mahsa Shafaei, Yigeng Zhang, Hugo Jair Escalante, Thamar Solorio

TL;DR

This work targets the challenging problem of detecting comic mischief in videos by leveraging a novel three-modal dataset (video, text, audio) and a dedicated end-to-end model, HICCAP, that employs hierarchical cross-attention to fuse modalities. The approach combines strong feature encoding, caption-based subtitle completion, hybrid multimodal pretraining (VTM, VAM, ATM) with contrastive learning, and fine-tuning for both binary detection and multi-task subtype classification. Through extensive ablations and comparisons, the authors demonstrate significant performance gains over baselines and several state-of-the-art methods, including a new best AP on XD-Violence (92.17%), and competitive results on UCF101/HMDB51 with fewer parameters. The dataset and method collectively advance multimodal humor-aware content analysis with practical implications for content filtering and safety in online media.

Abstract

We address the challenge of detecting questionable content in online media, specifically the subcategory of comic mischief. This type of content combines elements such as violence, adult content, or sarcasm with humor, making it difficult to detect. Employing a multimodal approach is vital to capture the subtle details inherent in comic mischief content. To tackle this problem, we propose a novel end-to-end multimodal system for the task of comic mischief detection. As part of this contribution, we release a novel dataset for the targeted task consisting of three modalities: video, text (video captions and subtitles), and audio. We also design a HIerarchical Cross-attention model with CAPtions (HICCAP) to capture the intricate relationships among these modalities. The results show that the proposed approach makes a significant improvement over robust baselines and state-of-the-art models for comic mischief detection and its type classification. This emphasizes the potential of our system to empower users, to make informed decisions about the online content they choose to see. In addition, we conduct experiments on the UCF101, HMDB51, and XD-Violence datasets, comparing our model against other state-of-the-art approaches showcasing the outstanding performance of our proposed model in various scenarios.

Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model

TL;DR

This work targets the challenging problem of detecting comic mischief in videos by leveraging a novel three-modal dataset (video, text, audio) and a dedicated end-to-end model, HICCAP, that employs hierarchical cross-attention to fuse modalities. The approach combines strong feature encoding, caption-based subtitle completion, hybrid multimodal pretraining (VTM, VAM, ATM) with contrastive learning, and fine-tuning for both binary detection and multi-task subtype classification. Through extensive ablations and comparisons, the authors demonstrate significant performance gains over baselines and several state-of-the-art methods, including a new best AP on XD-Violence (92.17%), and competitive results on UCF101/HMDB51 with fewer parameters. The dataset and method collectively advance multimodal humor-aware content analysis with practical implications for content filtering and safety in online media.

Abstract

We address the challenge of detecting questionable content in online media, specifically the subcategory of comic mischief. This type of content combines elements such as violence, adult content, or sarcasm with humor, making it difficult to detect. Employing a multimodal approach is vital to capture the subtle details inherent in comic mischief content. To tackle this problem, we propose a novel end-to-end multimodal system for the task of comic mischief detection. As part of this contribution, we release a novel dataset for the targeted task consisting of three modalities: video, text (video captions and subtitles), and audio. We also design a HIerarchical Cross-attention model with CAPtions (HICCAP) to capture the intricate relationships among these modalities. The results show that the proposed approach makes a significant improvement over robust baselines and state-of-the-art models for comic mischief detection and its type classification. This emphasizes the potential of our system to empower users, to make informed decisions about the online content they choose to see. In addition, we conduct experiments on the UCF101, HMDB51, and XD-Violence datasets, comparing our model against other state-of-the-art approaches showcasing the outstanding performance of our proposed model in various scenarios.
Paper Structure (28 sections, 7 equations, 4 figures, 8 tables)

This paper contains 28 sections, 7 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comic mischief examples in movies
  • Figure 2: User interface for labeling dataset videos.
  • Figure 3: Distribution of comic mischief categories.
  • Figure 4: a) The general architecture of HICCAP consists of four components: 1) Feature-Encoding, 2) Hierarchical-Cross-Attention mechanisms, 3) Pretraining, and 4) Binary and Multi-Task Prediction and b) The structure of Hierarchical Cross-Attention (HCA) module.