Table of Contents
Fetching ...

Latent Spatiotemporal Adaptation for Generalized Face Forgery Video Detection

Daichi Zhang, Zihao Xiao, Jianmin Li, Shiming Ge

TL;DR

It is found that different forgery videos have distinct spatiotemporal patterns, which may be the key to generalization, and a Latent Spatiotemporal Adaptation~(LAST) approach is proposed to facilitate generalized face forgery video detection.

Abstract

Face forgery videos have caused severe public concerns, and many detectors have been proposed. However, most of these detectors suffer from limited generalization when detecting videos from unknown distributions, such as from unseen forgery methods. In this paper, we find that different forgery videos have distinct spatiotemporal patterns, which may be the key to generalization. To leverage this finding, we propose a Latent Spatiotemporal Adaptation~(LAST) approach to facilitate generalized face forgery video detection. The key idea is to optimize the detector adaptive to the spatiotemporal patterns of unknown videos in latent space to improve the generalization. Specifically, we first model the spatiotemporal patterns of face videos by incorporating a lightweight CNN to extract local spatial features of each frame and then cascading a vision transformer to learn the long-term spatiotemporal representations in latent space, which should contain more clues than in pixel space. Then by optimizing a transferable linear head to perform the usual forgery detection task on known videos and recover the spatiotemporal clues of unknown target videos in a semi-supervised manner, our detector could flexibly adapt to unknown videos' spatiotemporal patterns, leading to improved generalization. Additionally, to eliminate the influence of specific forgery videos, we pre-train our CNN and transformer only on real videos with two simple yet effective self-supervised tasks: reconstruction and contrastive learning in latent space and keep them frozen during fine-tuning. Extensive experiments on public datasets demonstrate that our approach achieves state-of-the-art performance against other competitors with impressive generalization and robustness.

Latent Spatiotemporal Adaptation for Generalized Face Forgery Video Detection

TL;DR

It is found that different forgery videos have distinct spatiotemporal patterns, which may be the key to generalization, and a Latent Spatiotemporal Adaptation~(LAST) approach is proposed to facilitate generalized face forgery video detection.

Abstract

Face forgery videos have caused severe public concerns, and many detectors have been proposed. However, most of these detectors suffer from limited generalization when detecting videos from unknown distributions, such as from unseen forgery methods. In this paper, we find that different forgery videos have distinct spatiotemporal patterns, which may be the key to generalization. To leverage this finding, we propose a Latent Spatiotemporal Adaptation~(LAST) approach to facilitate generalized face forgery video detection. The key idea is to optimize the detector adaptive to the spatiotemporal patterns of unknown videos in latent space to improve the generalization. Specifically, we first model the spatiotemporal patterns of face videos by incorporating a lightweight CNN to extract local spatial features of each frame and then cascading a vision transformer to learn the long-term spatiotemporal representations in latent space, which should contain more clues than in pixel space. Then by optimizing a transferable linear head to perform the usual forgery detection task on known videos and recover the spatiotemporal clues of unknown target videos in a semi-supervised manner, our detector could flexibly adapt to unknown videos' spatiotemporal patterns, leading to improved generalization. Additionally, to eliminate the influence of specific forgery videos, we pre-train our CNN and transformer only on real videos with two simple yet effective self-supervised tasks: reconstruction and contrastive learning in latent space and keep them frozen during fine-tuning. Extensive experiments on public datasets demonstrate that our approach achieves state-of-the-art performance against other competitors with impressive generalization and robustness.
Paper Structure (19 sections, 8 equations, 5 figures, 5 tables)

This paper contains 19 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The distinct spatiotemporal patterns of real and different forgery videos (a) may cause the distribution gap in latent space when training a naive detector (b), leading to limited generalization.
  • Figure 2: The whole pipeline of our proposed method, including the latent spatiotemporal adaptation stage (top) and the common spatiotemporal initialization stage (bottom).
  • Figure 3: Effect of the common spatiotemporal initialization under both intra- and cross-dataset settings.
  • Figure 4: The t-SNE tSNE visualization of the spatiotemporal representation under both intra- (top left) and cross-dataset (other three) settings.
  • Figure 5: Grad-CAM DBLP:conf/iccv/SelvarajuCDVPB17 results under both intra- and cross-dataset settings. We find that our proposed method can effectively respond to the forgery traces.