Table of Contents
Fetching ...

DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection

Fan Nie, Jiangqun Ni, Jian Zhang, Bin Zhang, Weizhe Zhang

TL;DR

A transformer-based framework for Diffusion Learning of Inconsistency Pattern (DIP) is proposed, which exploits directional inconsistencies for deepfake video detection and could effectively identify directional forgery clues and achieve state-of-the-art performance.

Abstract

With the advancement of deepfake generation techniques, the importance of deepfake detection in protecting multimedia content integrity has become increasingly obvious. Recently, temporal inconsistency clues have been explored to improve the generalizability of deepfake video detection. According to our observation, the temporal artifacts of forged videos in terms of motion information usually exhibits quite distinct inconsistency patterns along horizontal and vertical directions, which could be leveraged to improve the generalizability of detectors. In this paper, a transformer-based framework for Diffusion Learning of Inconsistency Pattern (DIP) is proposed, which exploits directional inconsistencies for deepfake video detection. Specifically, DIP begins with a spatiotemporal encoder to represent spatiotemporal information. A directional inconsistency decoder is adopted accordingly, where direction-aware attention and inconsistency diffusion are incorporated to explore potential inconsistency patterns and jointly learn the inherent relationships. In addition, the SpatioTemporal Invariant Loss (STI Loss) is introduced to contrast spatiotemporally augmented sample pairs and prevent the model from overfitting nonessential forgery artifacts. Extensive experiments on several public datasets demonstrate that our method could effectively identify directional forgery clues and achieve state-of-the-art performance.

DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection

TL;DR

A transformer-based framework for Diffusion Learning of Inconsistency Pattern (DIP) is proposed, which exploits directional inconsistencies for deepfake video detection and could effectively identify directional forgery clues and achieve state-of-the-art performance.

Abstract

With the advancement of deepfake generation techniques, the importance of deepfake detection in protecting multimedia content integrity has become increasingly obvious. Recently, temporal inconsistency clues have been explored to improve the generalizability of deepfake video detection. According to our observation, the temporal artifacts of forged videos in terms of motion information usually exhibits quite distinct inconsistency patterns along horizontal and vertical directions, which could be leveraged to improve the generalizability of detectors. In this paper, a transformer-based framework for Diffusion Learning of Inconsistency Pattern (DIP) is proposed, which exploits directional inconsistencies for deepfake video detection. Specifically, DIP begins with a spatiotemporal encoder to represent spatiotemporal information. A directional inconsistency decoder is adopted accordingly, where direction-aware attention and inconsistency diffusion are incorporated to explore potential inconsistency patterns and jointly learn the inherent relationships. In addition, the SpatioTemporal Invariant Loss (STI Loss) is introduced to contrast spatiotemporally augmented sample pairs and prevent the model from overfitting nonessential forgery artifacts. Extensive experiments on several public datasets demonstrate that our method could effectively identify directional forgery clues and achieve state-of-the-art performance.

Paper Structure

This paper contains 37 sections, 11 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Illustration of the temporal inconsistencies. For a pair of real and fake videos, the motion information in terms of optical flow is extracted and visualized with the TVL1 algorithm wedel_improved_2009. Each optical flow frame is then sliced to obtain horizontal and vertical motion slices for real and fake videos. The comparison between real and fake videos for the average temporal motion evolution reveals the inconsistency along both the horizontal and vertical directions.
  • Figure 2: Overview of the proposed DIP. STE extracts forgery spatiotemporal features with embedded sequences. The DID then models inconsistency patterns and fuses features, and MDC exploits the classification tokens for final prediction.
  • Figure 3: Illustration of a unit STE block. Spatial attention is used to extract spatial dependency for each frame, and temporal attention is applied to characterize the temporal dependency at a specific location across multiple frames.
  • Figure 4: Calculation of motion similarity transition matrix with horizontal and vertical token sequences zh and zv. The transition matrix $P$ of $2L \times 2L$ consists of four types of submatrices $(L \times L)$, i.e., horizontal-horizontal transition (blue dotted, $P_{hh}$), horizontal-vertical transition (blue solid, $P_{hv}$), vertical-horizontal transition (purple solid, $P_{vh}$), and vertical-vertical transition (purple dotted, $P_{vv}$).
  • Figure 5: Overview of the proposed optimization framework.
  • ...and 5 more figures