Table of Contents
Fetching ...

Fusion Transformer with Object Mask Guidance for Image Forgery Analysis

Dimitrios Karageorgiou, Giorgos Kordopatis-Zilos, Symeon Papadopoulos

TL;DR

OMG-Fuser introduces a Transformer-based fusion framework that jointly processes RGB data and an arbitrary number of forensic signals through dedicated streams, guided by object-level information via an Object Guided Attention mechanism. A Token Fusion Transformer then aggregates per-patch representations across streams, and a Long-range Dependencies Transformer captures global relationships to output pixel-level forgery masks and image-level detection scores. The approach supports both feature-level and score-level fusion, achieves state-of-the-art performance across seven datasets, and demonstrates robustness to common perturbations and neural filters while enabling expansion with new signals without retraining from scratch. The method leverages instance segmentation (via SAM) and a pretrained RGB backbone (DINOv2), delivering strong practical utility for robust image forgery analysis and localization in the wild.

Abstract

In this work, we introduce OMG-Fuser, a fusion transformer-based network designed to extract information from various forensic signals to enable robust image forgery detection and localization. Our approach can operate with an arbitrary number of forensic signals and leverages object information for their analysis -- unlike previous methods that rely on fusion schemes with few signals and often disregard image semantics. To this end, we design a forensic signal stream composed of a transformer guided by an object attention mechanism, associating patches that depict the same objects. In that way, we incorporate object-level information from the image. Each forensic signal is processed by a different stream that adapts to its peculiarities. A token fusion transformer efficiently aggregates the outputs of an arbitrary number of network streams and generates a fused representation for each image patch. We assess two fusion variants on top of the proposed approach: (i) score-level fusion that fuses the outputs of multiple image forensics algorithms and (ii) feature-level fusion that fuses low-level forensic traces directly. Both variants exceed state-of-the-art performance on seven datasets for image forgery detection and localization, with a relative average improvement of 12.1% and 20.4% in terms of F1. Our model is robust against traditional and novel forgery attacks and can be expanded with new signals without training from scratch. Our code is publicly available at: https://github.com/mever-team/omgfuser

Fusion Transformer with Object Mask Guidance for Image Forgery Analysis

TL;DR

OMG-Fuser introduces a Transformer-based fusion framework that jointly processes RGB data and an arbitrary number of forensic signals through dedicated streams, guided by object-level information via an Object Guided Attention mechanism. A Token Fusion Transformer then aggregates per-patch representations across streams, and a Long-range Dependencies Transformer captures global relationships to output pixel-level forgery masks and image-level detection scores. The approach supports both feature-level and score-level fusion, achieves state-of-the-art performance across seven datasets, and demonstrates robustness to common perturbations and neural filters while enabling expansion with new signals without retraining from scratch. The method leverages instance segmentation (via SAM) and a pretrained RGB backbone (DINOv2), delivering strong practical utility for robust image forgery analysis and localization in the wild.

Abstract

In this work, we introduce OMG-Fuser, a fusion transformer-based network designed to extract information from various forensic signals to enable robust image forgery detection and localization. Our approach can operate with an arbitrary number of forensic signals and leverages object information for their analysis -- unlike previous methods that rely on fusion schemes with few signals and often disregard image semantics. To this end, we design a forensic signal stream composed of a transformer guided by an object attention mechanism, associating patches that depict the same objects. In that way, we incorporate object-level information from the image. Each forensic signal is processed by a different stream that adapts to its peculiarities. A token fusion transformer efficiently aggregates the outputs of an arbitrary number of network streams and generates a fused representation for each image patch. We assess two fusion variants on top of the proposed approach: (i) score-level fusion that fuses the outputs of multiple image forensics algorithms and (ii) feature-level fusion that fuses low-level forensic traces directly. Both variants exceed state-of-the-art performance on seven datasets for image forgery detection and localization, with a relative average improvement of 12.1% and 20.4% in terms of F1. Our model is robust against traditional and novel forgery attacks and can be expanded with new signals without training from scratch. Our code is publicly available at: https://github.com/mever-team/omgfuser
Paper Structure (25 sections, 6 equations, 9 figures, 8 tables)

This paper contains 25 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: OMG-Fuser combines an arbitrary number of heterogenous forensic signals for robust image forgery analysis guided by the image semantics.
  • Figure 2: Overview of OMG-Fuser. Forensic signals are fused into a robust forgery localization mask and detection score. To achieve that, it combines information from the RGB image and its instance segmentation maps. Each forensic signal and the RGB image are first processed by separate network streams through independent Object-Guided Transformers. Then, the proposed Token Fusion Module fuses the different streams, leading to features with a progressively increasing level of information granularity, from patch-level (in the early stages) to object-level (in intermediate stages) and to image-level (in the final stages). A localization and a detection head process the extracted forensic tokens to generate the final outputs.
  • Figure 3: Object-Guided Attention Mask: Limits the attention of the transformer only between patches that depict the same objects. The four attention regions defined by the mask for an example image are depicted to the right. The background is considered as another object. For illustration purposes, the number of patches on both axes has been limited to eight.
  • Figure 4: Robustness evaluation on common perturbations. The pixel-level F1 is reported. Straight lines denote the feature-level approaches, and dashed lines the score-level approaches. The top approaches of each category are shown for readability.
  • Figure 5: Robustness against neural filters for removing JPEG artifacts. The pixel-level F1 is reported.
  • ...and 4 more figures