Multi-Channel Cross Modal Detection of Synthetic Face Images
M. Ibsen, C. Rathgeb, S. Marcel, C. Busch
TL;DR
This work tackles the challenge of detecting synthetic face images produced by modern generative models under realistic post-processing. It introduces a dual-channel CNN that processes RGB imagery and per-channel frequency spectra, supervised by Cross Modal Focal Loss to balance channel contributions and focus on difficult samples. The approach, combining a DenseNet-based architecture with CMFL, demonstrates improved generalization to unseen generators and post-processing, outperforming single-channel and BCE-based baselines in cross-model evaluations. The results underscore the value of integrating spatial and spectral cues and novel loss functions for robust synthetic-media detection in practice.
Abstract
Synthetically generated face images have shown to be indistinguishable from real images by humans and as such can lead to a lack of trust in digital content as they can, for instance, be used to spread misinformation. Therefore, the need to develop algorithms for detecting entirely synthetic face images is apparent. Of interest are images generated by state-of-the-art deep learning-based models, as these exhibit a high level of visual realism. Recent works have demonstrated that detecting such synthetic face images under realistic circumstances remains difficult as new and improved generative models are proposed with rapid speed and arbitrary image post-processing can be applied. In this work, we propose a multi-channel architecture for detecting entirely synthetic face images which analyses information both in the frequency and visible spectra using Cross Modal Focal Loss. We compare the proposed architecture with several related architectures trained using Binary Cross Entropy and show in cross-model experiments that the proposed architecture supervised using Cross Modal Focal Loss, in general, achieves most competitive performance.
