Table of Contents
Fetching ...

Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion

Syed Hammad Ahmed, Muhammad Junaid Khan, Gita Sukthankar

TL;DR

This work addresses the need for robust moderation of children's video content by introducing a multimodal framework that fuses audio cues with video features using an adapted CLIP architecture. The approach freezes backbone encoders while adding a pre trained AudioCLIP-based audio encoder, a trainable audio projection, and learnable prompts in both the vision and text branches, with temporal pooling for video. A new MMOB dataset with labeled audio and video annotations enables supervised and few-shot evaluation, showing that incorporating audio and prompts improves performance over unimodal or nonprompt baselines. The method offers a computationally efficient path to effective content moderation with practical impact for platforms serving children and can be extended to other video formats and advertisement content.

Abstract

Due to the rise in video content creation targeted towards children, there is a need for robust content moderation schemes for video hosting platforms. A video that is visually benign may include audio content that is inappropriate for young children while being impossible to detect with a unimodal content moderation system. Popular video hosting platforms for children such as YouTube Kids still publish videos which contain audio content that is not conducive to a child's healthy behavioral and physical development. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. To address this, we present an efficient adaptation of CLIP (Contrastive Language-Image Pre-training) that can leverage contextual audio cues for enhanced content moderation. We incorporate 1) the audio modality and 2) prompt learning, while keeping the backbone modules of each modality frozen. We conduct our experiments on a multimodal version of the MOB (Malicious or Benign) dataset in supervised and few-shot settings.

Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion

TL;DR

This work addresses the need for robust moderation of children's video content by introducing a multimodal framework that fuses audio cues with video features using an adapted CLIP architecture. The approach freezes backbone encoders while adding a pre trained AudioCLIP-based audio encoder, a trainable audio projection, and learnable prompts in both the vision and text branches, with temporal pooling for video. A new MMOB dataset with labeled audio and video annotations enables supervised and few-shot evaluation, showing that incorporating audio and prompts improves performance over unimodal or nonprompt baselines. The method offers a computationally efficient path to effective content moderation with practical impact for platforms serving children and can be extended to other video formats and advertisement content.

Abstract

Due to the rise in video content creation targeted towards children, there is a need for robust content moderation schemes for video hosting platforms. A video that is visually benign may include audio content that is inappropriate for young children while being impossible to detect with a unimodal content moderation system. Popular video hosting platforms for children such as YouTube Kids still publish videos which contain audio content that is not conducive to a child's healthy behavioral and physical development. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. To address this, we present an efficient adaptation of CLIP (Contrastive Language-Image Pre-training) that can leverage contextual audio cues for enhanced content moderation. We incorporate 1) the audio modality and 2) prompt learning, while keeping the backbone modules of each modality frozen. We conduct our experiments on a multimodal version of the MOB (Malicious or Benign) dataset in supervised and few-shot settings.
Paper Structure (23 sections, 3 figures, 6 tables)

This paper contains 23 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: One of the malicious video examples currently available on the YouTube Kids platform that shows a furious cartoon character shooting at other cartoon characters with a machine gun.
  • Figure 2: A sample video which includes fast and loud piano notes. Usually in such videos, there is high tempo, a lack of rhythm and varying pitches. The video also includes bright and striking hues, and suggests the violent action of "hitting the piano with a bat".
  • Figure 3: Our proposed architecture incorporates the audio modality by adding a pre-trained audio encoder. The inputs of the audio encoder are spectrograms which are visual representations of audio frequency signals. The trainable projection layer learns audio representations for the downstream content moderation task. Temporal pooling outputs a combined representation of T input video frames, hence adapting Vanilla CLIP for video. These audio and visual representations are fused together within the Feature Fusion block. Vanilla CLIP's text and vision branches are adapted to include learnable prompts (tokens) through all layers. We keep all encoder layers of text, vision, and audio branches frozen. The input from the text branch is the class name, e.g. "malicious" as shown in the figure.