Table of Contents
Fetching ...

Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection

Kutub Uddin, Nusrat Tasnim, Byung Tae Oh

Abstract

Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation ($HFR$) that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42\% on FF++, 79.80\% on CDF1, 85.34\% on CDF2, 89.41\% on DFD, 84.07\% on DFDC, 95.62\% on DTIM, 80.76\% on PDD, and 100\% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.

Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection

Abstract

Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation () that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42\% on FF++, 79.80\% on CDF1, 85.34\% on CDF2, 89.41\% on DFD, 84.07\% on DFDC, 95.62\% on DTIM, 80.76\% on PDD, and 100\% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.

Paper Structure

This paper contains 34 sections, 8 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Key concepts of the proposed method: (a) The conventional deepfake detection, which relies solely on facial artifact analysis. (c) In contrast, the proposed method, which leverages hierarchical information across multiple levels, coarse (frame), medium (face), and fine-grained (lips, left eye, right eye, and nose), aims to improve deepfake detection performance.
  • Figure 2: Categories of deepfake detection methods: (a) conventional methods, which rely on handcrafted features, (b) deep learning-based methods that automatically learn discriminative representations from data, and (c) hybrid approaches that combine traditional cues with deep learning models to improve detection performance.
  • Figure 3: Examples of deepfake generation using different manipulation techniques: (a) Face swapping, where the source face is replaced with a target face in the FF++ rossler2019faceforensics++/FSW dataset, and (b) lip synchronization, where facial lip movements are manipulated using different audio tracks in the PDD sankaranarayanan2021presidential dataset.
  • Figure 4: Performance analysis comparing feature representations: This study compares face-only and lips-only features with the proposed HFR approach. The results demonstrate that hierarchical feature representations provide superior effectiveness for deepfake detection across the FF++ rossler2019faceforensics++/FSW, FF++ rossler2019faceforensics++/NT, and PDD sankaranarayanan2021presidential datasets.
  • Figure 5: Architecture of the proposed method: The proposed framework is composed of three sequential phases. In Phase 1, an encoder network is employed to extract hierarchical feature representations from multiple facial regions, including the full frame, face, lips, left eye, right eye, and nose. In Phase 2, a deep triplet learning strategy combined with a CA-MLP mechanism is introduced to model and exploit interrelationships among regional features, yielding discriminative embeddings organized into two clusters. Finally, in Phase 3, the learned embeddings are passed to a classifier for final prediction.
  • ...and 8 more figures