Table of Contents
Fetching ...

MMFusion: Combining Image Forensic Filters for Visual Manipulation Detection and Localization

Kostas Triaridis, Konstantinos Tsigos, Vasileios Mezaris

TL;DR

This paper assesses two distinct combination methods: one that produces independent features from each forensic filter and then fuses them and one that performs early mixing of different modal outputs and produces combined features (this is referred to as early fusion).

Abstract

Recent image manipulation localization and detection techniques typically leverage forensic artifacts and traces that are produced by a noise-sensitive filter, such as SRM or Bayar convolution. In this paper, we showcase that different filters commonly used in such approaches excel at unveiling different types of manipulations and provide complementary forensic traces. Thus, we explore ways of combining the outputs of such filters to leverage the complementary nature of the produced artifacts for performing image manipulation localization and detection (IMLD). We assess two distinct combination methods: one that produces independent features from each forensic filter and then fuses them (this is referred to as late fusion) and one that performs early mixing of different modal outputs and produces combined features (this is referred to as early fusion). We use the latter as a feature encoding mechanism, accompanied by a new decoding mechanism that encompasses feature re-weighting, for formulating the proposed MMFusion architecture. We demonstrate that MMFusion achieves competitive performance for both image manipulation localization and detection, outperforming state-of-the-art models across several image and video datasets. We also investigate further the contribution of each forensic filter within MMFusion for addressing different types of manipulations, building on recent AI explainability measures.

MMFusion: Combining Image Forensic Filters for Visual Manipulation Detection and Localization

TL;DR

This paper assesses two distinct combination methods: one that produces independent features from each forensic filter and then fuses them and one that performs early mixing of different modal outputs and produces combined features (this is referred to as early fusion).

Abstract

Recent image manipulation localization and detection techniques typically leverage forensic artifacts and traces that are produced by a noise-sensitive filter, such as SRM or Bayar convolution. In this paper, we showcase that different filters commonly used in such approaches excel at unveiling different types of manipulations and provide complementary forensic traces. Thus, we explore ways of combining the outputs of such filters to leverage the complementary nature of the produced artifacts for performing image manipulation localization and detection (IMLD). We assess two distinct combination methods: one that produces independent features from each forensic filter and then fuses them (this is referred to as late fusion) and one that performs early mixing of different modal outputs and produces combined features (this is referred to as early fusion). We use the latter as a feature encoding mechanism, accompanied by a new decoding mechanism that encompasses feature re-weighting, for formulating the proposed MMFusion architecture. We demonstrate that MMFusion achieves competitive performance for both image manipulation localization and detection, outperforming state-of-the-art models across several image and video datasets. We also investigate further the contribution of each forensic filter within MMFusion for addressing different types of manipulations, building on recent AI explainability measures.
Paper Structure (25 sections, 1 equation, 11 figures, 10 tables)

This paper contains 25 sections, 1 equation, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Overview of the MMFusion Encoder-Decoder architecture for image localization and detection with multiple forensic filters. The RGB image and the output of each filter are fed into a Multi-Scale encoder, whose output is passed onto both the anomaly decoder, which produces a localization map, and the confidence decoder, which produces a confidence map. The two maps are then combined through a pooling module and passed into the forgery detector to produce the manipulation detection score.
  • Figure 2: Architecture of the dual-branch encoder of cmx. The encoder is made of 4 stages of Multi-Head Self Attention (MHSA) blocks to produce feature maps $f_{mod}^i$ for modality $mod\in\{image, filter\}$ and stage $i\in\{1,2,3,4\}$. These are then fused and rectified by the FRM and FFM modules to produce the outputs $F^i$ at each scale $i$. The feature map set $F=\{F^i, i=1,...4\}$ is the final output returned by the encoder.
  • Figure 3: Proposed architecture of the encoder for fusion of multiple forensic filters by late fusion with weight sharing. The filters' outputs and the RGB image are fed into separate MultiHead Self-Attention (MHSA) blocks of the dual-branch CMX encoder, with the outputs rectified and combined by the FRM and FFM modules to produce the feature maps. These are propagated through different stages to create feature maps of varying scales. The weights of the MHSA blocks of all RGB branches are shared to increase regularization.
  • Figure 4: Proposed architecture of the encoder for fusion of multiple forensic filters by early convolutions. On the left, we illustrate the structure of the encoder. More specifically, the filters' outputs are initially fused by early convolutional blocks in the Early Fusion Module, to produce the mixed features $f_a$. These features and the RGB image are then fed into separate MultiHead Self-Attention (MHSA) blocks of a dual-branch CMX encoder, with the outputs rectified and combined by the FRM and FFM modules to produce the feature maps. These are propagated through different stages to create feature maps of varying scales. The structure of the convolutional block is presented on the right side.
  • Figure 5: Proposed architecture of the Feature Re-weighting Decoder (FRD). The feature maps $F$ returned from the encoder are processed through convolutional layers, batch normalization and activation functions, and weighted channel- and spatial-wise feature maps that enhance subtle variations in the input maps are produced. These are then passed to the MLP-based decoder to generate the localization/confidence map.
  • ...and 6 more figures