Table of Contents
Fetching ...

Multi-spectral Class Center Network for Face Manipulation Detection and Localization

Changtao Miao, Qi Chu, Zhentao Tan, Zhenchao Jin, Tao Gong, Wanyi Zhuang, Yue Wu, Bin Liu, Honggang Hu, Nenghai Yu

TL;DR

This work addresses the need for precise, explainable localization of face manipulations by moving beyond image-level detection to pixel-level predictions that leverage frequency-domain forgery cues. The authors introduce MSCCNet, a two-branch architecture that combines Multi-level Features Aggregation (MFA) with a Multi-spectral Class Center (MSCC) module to learn semantic-agnostic, frequency-aware representations, further refined by a graph-based cross-spectral interaction. They implement a Discrete Cosine Transform-based frequency decomposition and a spectral-class center attention mechanism, enabling robust localization across diverse manipulations and unseen datasets. Extensive experiments on reconstructed P-FF++ and the Dolos dataset demonstrate superior localization performance and strong cross-dataset and cross-manipulation generalization, accompanied by meaningful ablations that validate the contributions of MFA, MSCC, DCT filters, and spectral fusion. The work offers a practical, generalizable approach to forensic localization with potential for broader application in pixel-level manipulation analysis.

Abstract

As deepfake content proliferates online, advancing face manipulation forensics has become crucial. To combat this emerging threat, previous methods mainly focus on studying how to distinguish authentic and manipulated face images. Although impressive, image-level classification lacks explainability and is limited to specific application scenarios, spurring recent research on pixel-level prediction for face manipulation forensics. However, existing forgery localization methods suffer from exploring frequency-based forgery traces in the localization network. In this paper, we observe that multi-frequency spectrum information is effective for identifying tampered regions. To this end, a novel Multi-Spectral Class Center Network (MSCCNet) is proposed for face manipulation detection and localization. Specifically, we design a Multi-Spectral Class Center (MSCC) module to learn more generalizable and multi-frequency features. Based on the features of different frequency bands, the MSCC module collects multi-spectral class centers and computes pixel-to-class relations. Applying multi-spectral class-level representations suppresses the semantic information of the visual concepts which is insensitive to manipulated regions of forgery images. Furthermore, we propose a Multi-level Features Aggregation (MFA) module to employ more low-level forgery artifacts and structural textures. Meanwhile, we conduct a comprehensive localization benchmark based on pixel-level FF++ and Dolos datasets. Experimental results quantitatively and qualitatively demonstrate the effectiveness and superiority of the proposed MSCCNet. We expect this work to inspire more studies on pixel-level face manipulation localization. The codes are available (https://github.com/miaoct/MSCCNet).

Multi-spectral Class Center Network for Face Manipulation Detection and Localization

TL;DR

This work addresses the need for precise, explainable localization of face manipulations by moving beyond image-level detection to pixel-level predictions that leverage frequency-domain forgery cues. The authors introduce MSCCNet, a two-branch architecture that combines Multi-level Features Aggregation (MFA) with a Multi-spectral Class Center (MSCC) module to learn semantic-agnostic, frequency-aware representations, further refined by a graph-based cross-spectral interaction. They implement a Discrete Cosine Transform-based frequency decomposition and a spectral-class center attention mechanism, enabling robust localization across diverse manipulations and unseen datasets. Extensive experiments on reconstructed P-FF++ and the Dolos dataset demonstrate superior localization performance and strong cross-dataset and cross-manipulation generalization, accompanied by meaningful ablations that validate the contributions of MFA, MSCC, DCT filters, and spectral fusion. The work offers a practical, generalizable approach to forensic localization with potential for broader application in pixel-level manipulation analysis.

Abstract

As deepfake content proliferates online, advancing face manipulation forensics has become crucial. To combat this emerging threat, previous methods mainly focus on studying how to distinguish authentic and manipulated face images. Although impressive, image-level classification lacks explainability and is limited to specific application scenarios, spurring recent research on pixel-level prediction for face manipulation forensics. However, existing forgery localization methods suffer from exploring frequency-based forgery traces in the localization network. In this paper, we observe that multi-frequency spectrum information is effective for identifying tampered regions. To this end, a novel Multi-Spectral Class Center Network (MSCCNet) is proposed for face manipulation detection and localization. Specifically, we design a Multi-Spectral Class Center (MSCC) module to learn more generalizable and multi-frequency features. Based on the features of different frequency bands, the MSCC module collects multi-spectral class centers and computes pixel-to-class relations. Applying multi-spectral class-level representations suppresses the semantic information of the visual concepts which is insensitive to manipulated regions of forgery images. Furthermore, we propose a Multi-level Features Aggregation (MFA) module to employ more low-level forgery artifacts and structural textures. Meanwhile, we conduct a comprehensive localization benchmark based on pixel-level FF++ and Dolos datasets. Experimental results quantitatively and qualitatively demonstrate the effectiveness and superiority of the proposed MSCCNet. We expect this work to inspire more studies on pixel-level face manipulation localization. The codes are available (https://github.com/miaoct/MSCCNet).
Paper Structure (34 sections, 16 equations, 5 figures, 10 tables)

This paper contains 34 sections, 16 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Visualizing the different frequency spectrum maps (FSMs) from the localization network across various manipulation methods. The Image column shows examples of (a) Repaint lugmayr2022RePaint, (b) LaMa rombach2022high, (c) LDM suvorov2021resolution, and (d) Pluralistic zheng2019pluralistic manipulations. The Mask column indicates the ground truth tampered regions. Columns FSM 1-4 depict the network's multi-spectral feature maps at different frequency bands of the corresponding forged images. Tampered areas present a more homogeneous color distribution, while authentic regions display a more heterogeneous appearance.
  • Figure 2: Detailed architecture of the proposed MSCCNet. The overall network structure is shown in (a), which consists of a backbone network, a classification branch, and a localization branch. (b) shows the scheme of the forgery-related low-level texture features aggregation. (c) illustrates the process of multi-spectral class centers and different frequency attention calculations. They are solely dedicated to enhancing the capabilities of the localization branch.
  • Figure 3: Pixel-level annotation procedure for the P-FF++ dataset. The symbol $\ast$ is a multiplication operation.
  • Figure 4: Visualization mask predictions of baseline methods and our MSCCNet. The examples are randomly selected from the C40 test set of P-FF++ and Dolos. Every row indicates different face manipulation technologies. Columns Image and Mask represent the input forged face and its corresponding pixel-level label, respectively.
  • Figure 5: Visualization of different frequency class center maps (FCCM). The Image column shows different types of manipulation methods, where (a), (b), (c), and (d) correspond to Repaint lugmayr2022RePaint, LaMa rombach2022high, LDM suvorov2021resolution, and Pluralistic zheng2019pluralistic, respectively. The Mask column indicates the tampered regions of the forged images. The FCCM 1-4 columns represent the class center feature maps of the four frequency bands, indicating that our method effectively distinguishes the tampered regions from the authentic regions.