Table of Contents
Fetching ...

Multi-Channel Cross Modal Detection of Synthetic Face Images

M. Ibsen, C. Rathgeb, S. Marcel, C. Busch

TL;DR

This work tackles the challenge of detecting synthetic face images produced by modern generative models under realistic post-processing. It introduces a dual-channel CNN that processes RGB imagery and per-channel frequency spectra, supervised by Cross Modal Focal Loss to balance channel contributions and focus on difficult samples. The approach, combining a DenseNet-based architecture with CMFL, demonstrates improved generalization to unseen generators and post-processing, outperforming single-channel and BCE-based baselines in cross-model evaluations. The results underscore the value of integrating spatial and spectral cues and novel loss functions for robust synthetic-media detection in practice.

Abstract

Synthetically generated face images have shown to be indistinguishable from real images by humans and as such can lead to a lack of trust in digital content as they can, for instance, be used to spread misinformation. Therefore, the need to develop algorithms for detecting entirely synthetic face images is apparent. Of interest are images generated by state-of-the-art deep learning-based models, as these exhibit a high level of visual realism. Recent works have demonstrated that detecting such synthetic face images under realistic circumstances remains difficult as new and improved generative models are proposed with rapid speed and arbitrary image post-processing can be applied. In this work, we propose a multi-channel architecture for detecting entirely synthetic face images which analyses information both in the frequency and visible spectra using Cross Modal Focal Loss. We compare the proposed architecture with several related architectures trained using Binary Cross Entropy and show in cross-model experiments that the proposed architecture supervised using Cross Modal Focal Loss, in general, achieves most competitive performance.

Multi-Channel Cross Modal Detection of Synthetic Face Images

TL;DR

This work tackles the challenge of detecting synthetic face images produced by modern generative models under realistic post-processing. It introduces a dual-channel CNN that processes RGB imagery and per-channel frequency spectra, supervised by Cross Modal Focal Loss to balance channel contributions and focus on difficult samples. The approach, combining a DenseNet-based architecture with CMFL, demonstrates improved generalization to unseen generators and post-processing, outperforming single-channel and BCE-based baselines in cross-model evaluations. The results underscore the value of integrating spatial and spectral cues and novel loss functions for robust synthetic-media detection in practice.

Abstract

Synthetically generated face images have shown to be indistinguishable from real images by humans and as such can lead to a lack of trust in digital content as they can, for instance, be used to spread misinformation. Therefore, the need to develop algorithms for detecting entirely synthetic face images is apparent. Of interest are images generated by state-of-the-art deep learning-based models, as these exhibit a high level of visual realism. Recent works have demonstrated that detecting such synthetic face images under realistic circumstances remains difficult as new and improved generative models are proposed with rapid speed and arbitrary image post-processing can be applied. In this work, we propose a multi-channel architecture for detecting entirely synthetic face images which analyses information both in the frequency and visible spectra using Cross Modal Focal Loss. We compare the proposed architecture with several related architectures trained using Binary Cross Entropy and show in cross-model experiments that the proposed architecture supervised using Cross Modal Focal Loss, in general, achieves most competitive performance.
Paper Structure (15 sections, 6 equations, 5 figures, 1 table)

This paper contains 15 sections, 6 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Proposed multi-channel architecture for detecting synthetic face images. A RGB face image and its corresponding frequency spectra are fed into separate network channels based on DenseNet. Global Average Pooling (GAP) is applied to the output of each channel to obtain embeddings which are concatenated into a joint representation. For each embedding, a fully connected layer (FC) is added together with the Sigmoid function, which results in three network heads. We propose to supervise the network using Cross Modal Focal Loss.
  • Figure 2: Examples of pristine and synthetic images from the different used generative models.
  • Figure 3: Anylysis of the frequency spectra for FFHQ (pristine) and each of the generative models.
  • Figure 5: ROC curves for protocol I when training on StyleGAN2
  • Figure 6: ROC curves for protocol II when training on StyleGAN2.