Table of Contents
Fetching ...

WWW: Where, Which and Whatever Enhancing Interpretability in Multimodal Deepfake Detection

Juho Jung, Sangyoun Lee, Jooeon Kang, Yunjin Na

TL;DR

The paper tackles the gap in multimodal deepfake detection where current benchmarks rely on whole-video labels, masking localized, frame-level manipulations. It introduces FakeMix, a clip-level benchmark that randomizes one-second video/audio clips to reveal manipulated segments, and two metrics, Temporal Accuracy ($TA$) and Frame-wise Discrimination Metric ($FDM$), for granular evaluation. The TA and FDM definitions are $TA = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{F_{v_i}} \sum_{j=1}^{F_{v_i}} I(\hat{y}_{ij}=y_{ij}) \right)$ and $FDM = \frac{\sum_{i=1}^{N} \sum_{j=1}^{F_{v_i}} I(\hat{y}_{ij}=y_{ij})}{\sum_{i=1}^{N} F_{v_i}}$, enabling frame-level assessment of robustness. Empirical results show a large drop from video-level AP (≈94.2%) to clip-level TA/FDM (≈53.1% and 52.1%), indicating the need for segment-level evaluation to improve interpretability and reliability in deepfake detection. Overall, FakeMix advances interpretability by enabling precise localization of where, which modality, and which generation technique was used, guiding more robust, granular detection strategies in real-world multimodal contexts.

Abstract

All current benchmarks for multimodal deepfake detection manipulate entire frames using various generation techniques, resulting in oversaturated detection accuracies exceeding 94% at the video-level classification. However, these benchmarks struggle to detect dynamic deepfake attacks with challenging frame-by-frame alterations presented in real-world scenarios. To address this limitation, we introduce FakeMix, a novel clip-level evaluation benchmark aimed at identifying manipulated segments within both video and audio, providing insight into the origins of deepfakes. Furthermore, we propose novel evaluation metrics, Temporal Accuracy (TA) and Frame-wise Discrimination Metric (FDM), to assess the robustness of deepfake detection models. Evaluating state-of-the-art models against diverse deepfake benchmarks, particularly FakeMix, demonstrates the effectiveness of our approach comprehensively. Specifically, while achieving an Average Precision (AP) of 94.2% at the video-level, the evaluation of the existing models at the clip-level using the proposed metrics, TA and FDM, yielded sharp declines in accuracy to 53.1%, and 52.1%, respectively.

WWW: Where, Which and Whatever Enhancing Interpretability in Multimodal Deepfake Detection

TL;DR

The paper tackles the gap in multimodal deepfake detection where current benchmarks rely on whole-video labels, masking localized, frame-level manipulations. It introduces FakeMix, a clip-level benchmark that randomizes one-second video/audio clips to reveal manipulated segments, and two metrics, Temporal Accuracy () and Frame-wise Discrimination Metric (), for granular evaluation. The TA and FDM definitions are and , enabling frame-level assessment of robustness. Empirical results show a large drop from video-level AP (≈94.2%) to clip-level TA/FDM (≈53.1% and 52.1%), indicating the need for segment-level evaluation to improve interpretability and reliability in deepfake detection. Overall, FakeMix advances interpretability by enabling precise localization of where, which modality, and which generation technique was used, guiding more robust, granular detection strategies in real-world multimodal contexts.

Abstract

All current benchmarks for multimodal deepfake detection manipulate entire frames using various generation techniques, resulting in oversaturated detection accuracies exceeding 94% at the video-level classification. However, these benchmarks struggle to detect dynamic deepfake attacks with challenging frame-by-frame alterations presented in real-world scenarios. To address this limitation, we introduce FakeMix, a novel clip-level evaluation benchmark aimed at identifying manipulated segments within both video and audio, providing insight into the origins of deepfakes. Furthermore, we propose novel evaluation metrics, Temporal Accuracy (TA) and Frame-wise Discrimination Metric (FDM), to assess the robustness of deepfake detection models. Evaluating state-of-the-art models against diverse deepfake benchmarks, particularly FakeMix, demonstrates the effectiveness of our approach comprehensively. Specifically, while achieving an Average Precision (AP) of 94.2% at the video-level, the evaluation of the existing models at the clip-level using the proposed metrics, TA and FDM, yielded sharp declines in accuracy to 53.1%, and 52.1%, respectively.
Paper Structure (12 sections, 2 equations, 2 figures, 2 tables)

This paper contains 12 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Comparison between previous benchmark and the proposed benchmark, FakeMix. While FakeAVCeleb operated deepfake on complete video or audio segments for video-level classification, FakeMix introduces dynamic frame-level alterations to enhance evaluation of deepfake video detection.
  • Figure 2: Comprehensive overview of Temporal Accuracy and Frame-wise Discrimination Metric conducted on the Fakemix.