Table of Contents
Fetching ...

Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization

Changtao Miao, Qi Chu, Tao Gong, Zhentao Tan, Zhenchao Jin, Wanyi Zhuang, Man Luo, Honggang Hu, Nenghai Yu

TL;DR

This work tackles the challenge of detecting and localizing multi-face forgeries by introducing MoNFAP, a unified framework that jointly predicts image-level authenticity and pixel-level tampered regions. It combines a Forgery-aware Unified Predictor (FUP), which uses token learning and Forgery-aware Transformers to link classification with localization, with a Mixture-of-Noises Module (MNM) that injects diverse noise cues via a four-expert MoNE architecture to strengthen forgery cues in RGB features. The approach leverages a multi-scale strategy to detect small manipulated regions and employs an MoE-inspired gating mechanism with an Importance Loss to balance expert usage. Extensive benchmarks on curated multi-face datasets (OFV2, FFIW-derived variants, and Manual-Fake) demonstrate state-of-the-art localization performance, strong cross-dataset generalization, and robustness to real-world perturbations, highlighting the practical impact for reliable multi-face forgery analysis.

Abstract

With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited detection performance, or employ a naive two-branch structure to simultaneously obtain detection and localization results, which cannot effectively benefit the localization capability due to limited interaction between two tasks. This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The MoNFAP primarily introduces two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM). The FUP integrates detection and localization tasks using a token learning strategy and multiple forgery-aware transformers, which facilitates the use of classification information to enhance localization capability. Besides, motivated by the crucial role of noise information in forgery detection, the MNM leverages multiple noise extractors based on the concept of the mixture of experts to enhance the general RGB features, further boosting the performance of our framework. Finally, we establish a comprehensive benchmark for multi-face detection and localization and the proposed \textit{MoNFAP} achieves significant performance. The codes will be made available.

Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization

TL;DR

This work tackles the challenge of detecting and localizing multi-face forgeries by introducing MoNFAP, a unified framework that jointly predicts image-level authenticity and pixel-level tampered regions. It combines a Forgery-aware Unified Predictor (FUP), which uses token learning and Forgery-aware Transformers to link classification with localization, with a Mixture-of-Noises Module (MNM) that injects diverse noise cues via a four-expert MoNE architecture to strengthen forgery cues in RGB features. The approach leverages a multi-scale strategy to detect small manipulated regions and employs an MoE-inspired gating mechanism with an Importance Loss to balance expert usage. Extensive benchmarks on curated multi-face datasets (OFV2, FFIW-derived variants, and Manual-Fake) demonstrate state-of-the-art localization performance, strong cross-dataset generalization, and robustness to real-world perturbations, highlighting the practical impact for reliable multi-face forgery analysis.

Abstract

With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited detection performance, or employ a naive two-branch structure to simultaneously obtain detection and localization results, which cannot effectively benefit the localization capability due to limited interaction between two tasks. This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The MoNFAP primarily introduces two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM). The FUP integrates detection and localization tasks using a token learning strategy and multiple forgery-aware transformers, which facilitates the use of classification information to enhance localization capability. Besides, motivated by the crucial role of noise information in forgery detection, the MNM leverages multiple noise extractors based on the concept of the mixture of experts to enhance the general RGB features, further boosting the performance of our framework. Finally, we establish a comprehensive benchmark for multi-face detection and localization and the proposed \textit{MoNFAP} achieves significant performance. The codes will be made available.
Paper Structure (46 sections, 15 equations, 7 figures, 12 tables)

This paper contains 46 sections, 15 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Three different paradigms: (a) detection-by-localization which indirectly obtains image-level detection result from the pixel-level mask, (b) two-branch architecture employs separate classification branch and localization branch with shared backbone, and (c) our unified framework which integrates the detection and localization processing into a single predictor.
  • Figure 2: (a) The detection-by-localization method shows limited image-level detection performance. In contrast, two-branch and our methods can release the potential of the model’s detection capabilities. (b) The two-branch approach cannot effectively improve the localization performance compared to the detection-by-localization counterpart, while our method can facilitate the use of classification information to enhance localization capability. More experimental results and analyses are shown in Tab. \ref{['tab:ab_modes']} of Sec. \ref{['sec_Fab_1']}.
  • Figure 3: Detailed architecture of the proposed MoNFAP. Firstly, we employ the MoNP module, which consists of four Mixture of Noise Extractors (MoNE) modules. These MoNE modules process multi-scale features obtained from the backbone network. The MNM outputs noise patterns that enhance the general features within the FUP. Lastly, the FUP module utilizes the output tokens and the Forgery-aware Transformer (FAT) to jointly predict classification and localization results. To maintain clarity, we omit the two outputs of FAT and the generation of the auxiliary layer for the attention mask.
  • Figure 4: Detailed architecture of the Forgery-aware Transformer (FAT) and Mixture of Noise Extractor (MoNE) modules. (a) The blue and orange lines represent the output tokens and image features computation flow, respectively. The $\times2$ indicates that the computation is repeated twice. (b) In the MoNE module, the $\oplus$ denotes element-wise addition, while the $\otimes$ represents element-wise multiplication. The dashed line indicates the output of the adaptive weight computed by the gating network.
  • Figure 5: We present the collected datasets, namely OFV2, FFIW, and Manual-Fake. Row 'Real Image' represents genuine samples, row 'Fake Image' represents forged samples (one or more faces tampered with), and row 'Ground Truth' represents annotations of the tampered regions.
  • ...and 2 more figures