Table of Contents
Fetching ...

Beyond Real versus Fake Towards Intent-Aware Video Analysis

Saurabh Atreya, Nabyl Quignon, Baptiste Chopin, Abhijit Das, Antitza Dantcheva

TL;DR

The paper tackles the inadequacy of binary deepfake detection by proposing intent recognition in videos. It introduces IntentHQ, a multimodal dataset with 5168 human-centric videos across 23 intents, and benchmarks multimodal baselines using video, audio, and text. A novel three-way self-supervised pretraining framework aligns the three modalities and yields a top-1 accuracy of 52.5% after fine-tuning, demonstrating the value of cross-modal representations for intent understanding. The work analyzes modality contributions, the impact of language and audio quality, and category-specific performance, outlining future directions for longer videos and more nuanced labeling.

Abstract

The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including "Financial fraud", "Indirect marketing", "Political propaganda", as well as "Fear mongering". We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.

Beyond Real versus Fake Towards Intent-Aware Video Analysis

TL;DR

The paper tackles the inadequacy of binary deepfake detection by proposing intent recognition in videos. It introduces IntentHQ, a multimodal dataset with 5168 human-centric videos across 23 intents, and benchmarks multimodal baselines using video, audio, and text. A novel three-way self-supervised pretraining framework aligns the three modalities and yields a top-1 accuracy of 52.5% after fine-tuning, demonstrating the value of cross-modal representations for intent understanding. The work analyzes modality contributions, the impact of language and audio quality, and category-specific performance, outlining future directions for longer videos and more nuanced labeling.

Abstract

The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including "Financial fraud", "Indirect marketing", "Political propaganda", as well as "Fear mongering". We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.

Paper Structure

This paper contains 30 sections, 3 equations, 3 figures, 6 tables, 2 algorithms.

Figures (3)

  • Figure 1: Three-Way Contrastive Alignment Pipeline. Overview of the proposed training methodology. The augmented dataset is encoded using modality-specific encoders (CLIP for video, WavLM for audio, CLIP Text for text), projected into a shared space, and aligned through a three-way contrastive loss. The pretrained encoders are then fine-tuned using a supervised MLP classifier for intent prediction.
  • Figure 2: Samples from our IntentHQ dataset. IntentHQ contains $5$ broad categories further divided into $23$ total intent classes. The images above represent examples from each broad category.
  • Figure 3: Class Distribution within IntentHQ. Overview of the number of videos within each class of IntentHQ, denoting the grouping of the larger main categories as well as showing whether classes are considered benign or malicious.