Table of Contents
Fetching ...

Semantics-Oriented Multitask Learning for DeepFake Detection: A Joint Embedding Approach

Mian Zou, Baosheng Yu, Yibing Zhan, Siwei Lyu, Kede Ma

TL;DR

DeepFake detectors often rely on manipulation-specific cues, limiting generalization. This work proposes Semantics-Oriented Joint Embedding DeepFake Detector (SJEDD), which uses a semantics-driven dataset expansion and a joint vision-language embedding to model relationships between global face attributes and local regions. Bi-level optimization automates loss weighting to prioritize the primary detection task, improving cross-dataset and cross-manipulation performance while enhancing interpretability. Results show stronger generalization and human-understandable explanations, suggesting a robust path toward scalable DeepFake detection.

Abstract

In recent years, the multimedia forensics and security community has seen remarkable progress in multitask learning for DeepFake (i.e., face forgery) detection. The prevailing approach has been to frame DeepFake detection as a binary classification problem augmented by manipulation-oriented auxiliary tasks. This scheme focuses on learning features specific to face manipulations with limited generalizability. In this paper, we delve deeper into semantics-oriented multitask learning for DeepFake detection, capturing the relationships among face semantics via joint embedding. We first propose an automated dataset expansion technique that broadens current face forgery datasets to support semantics-oriented DeepFake detection tasks at both the global face attribute and local face region levels. Furthermore, we resort to the joint embedding of face images and labels (depicted by text descriptions) for prediction. This approach eliminates the need for manually setting task-agnostic and task-specific parameters, which is typically required when predicting multiple labels directly from images. In addition, we employ bi-level optimization to dynamically balance the fidelity loss weightings of various tasks, making the training process fully automated. Extensive experiments on six DeepFake datasets show that our method improves the generalizability of DeepFake detection and renders some degree of model interpretation by providing human-understandable explanations.

Semantics-Oriented Multitask Learning for DeepFake Detection: A Joint Embedding Approach

TL;DR

DeepFake detectors often rely on manipulation-specific cues, limiting generalization. This work proposes Semantics-Oriented Joint Embedding DeepFake Detector (SJEDD), which uses a semantics-driven dataset expansion and a joint vision-language embedding to model relationships between global face attributes and local regions. Bi-level optimization automates loss weighting to prioritize the primary detection task, improving cross-dataset and cross-manipulation performance while enhancing interpretability. Results show stronger generalization and human-understandable explanations, suggesting a robust path toward scalable DeepFake detection.

Abstract

In recent years, the multimedia forensics and security community has seen remarkable progress in multitask learning for DeepFake (i.e., face forgery) detection. The prevailing approach has been to frame DeepFake detection as a binary classification problem augmented by manipulation-oriented auxiliary tasks. This scheme focuses on learning features specific to face manipulations with limited generalizability. In this paper, we delve deeper into semantics-oriented multitask learning for DeepFake detection, capturing the relationships among face semantics via joint embedding. We first propose an automated dataset expansion technique that broadens current face forgery datasets to support semantics-oriented DeepFake detection tasks at both the global face attribute and local face region levels. Furthermore, we resort to the joint embedding of face images and labels (depicted by text descriptions) for prediction. This approach eliminates the need for manually setting task-agnostic and task-specific parameters, which is typically required when predicting multiple labels directly from images. In addition, we employ bi-level optimization to dynamically balance the fidelity loss weightings of various tasks, making the training process fully automated. Extensive experiments on six DeepFake datasets show that our method improves the generalizability of DeepFake detection and renders some degree of model interpretation by providing human-understandable explanations.
Paper Structure (28 sections, 8 equations, 7 figures, 11 tables)

This paper contains 28 sections, 8 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Illustration of the label hierarchy in our expanded FF++ dataset rossler2019faceforensics. The hierarchy organizes manipulation types into three nodes: identity (face-swapping via Deepfakes faceswap and FaceSwap faceswap_Kowalski), expression (mouth editing via Face2Face thies2016face2face and NeuralTextures thies2019deferred), and physical_inconsistency (local face editing via data augmentations). These nodes connect to six leaf nodes representing six distinct local face regions. Edges denote manipulation relationships: identity affects all regions, expression targets lip and mouth, while physical_inconsistency modifies eye, lip, mouth, and nose.
  • Figure 2: Pipeline for local face region manipulation. When searching for the best candidate as the source image, we minimize the Euclidean distance between the $68$ detected landmarks li2020face of the target and candidate face images (excluding those with the same identity).
  • Figure 3: System diagram of SJEDD.
  • Figure 4: Illustration of semantic relationships among tasks before and after semantics-oriented multitask learning. The relationships are shown using a correlation matrix, in which each entry represents the cosine similarity between two task-specific textual embeddings. Zoom in for improved visibility.
  • Figure 5: Training dynamics induced by the cross-entropy and fidelity losses.
  • ...and 2 more figures