Table of Contents
Fetching ...

SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies

Gagandeep Singh, Samudi Amarsinghe, Urawee Thani, Ki Fung Wong, Priyanka Singh, Xue Li

TL;DR

The paper tackles the blind spot of global FG--BG mismatches in multimodal disinformation detection by augmenting HAMMER with Segmentation-Guided Scoring (SGS), an inference-only pipeline that uses FG/BG segmentation to generate captions and compare semantic coherence in text space. SGS operates without retraining, producing region-aware scores that are fused with HAMMER to improve detection, grounding, and explanations. Experiments on an FG--BG inconsistent split show SGS as a strong standalone probe (high F1) and reveal complementary signals in contrastive and vision-only baselines, supporting the practical value of region-level reasoning. The work demonstrates that integrating lightweight, region-aware cues can significantly bolster robustness to global manipulations in multimodal disinformation, offering a scalable and reusable extension for existing detectors.

Abstract

We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER's original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at https://github.com/Gaganx0/HAMMER-sgs

SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies

TL;DR

The paper tackles the blind spot of global FG--BG mismatches in multimodal disinformation detection by augmenting HAMMER with Segmentation-Guided Scoring (SGS), an inference-only pipeline that uses FG/BG segmentation to generate captions and compare semantic coherence in text space. SGS operates without retraining, producing region-aware scores that are fused with HAMMER to improve detection, grounding, and explanations. Experiments on an FG--BG inconsistent split show SGS as a strong standalone probe (high F1) and reveal complementary signals in contrastive and vision-only baselines, supporting the practical value of region-level reasoning. The work demonstrates that integrating lightweight, region-aware cues can significantly bolster robustness to global manipulations in multimodal disinformation, offering a scalable and reusable extension for existing detectors.

Abstract

We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER's original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at https://github.com/Gaganx0/HAMMER-sgs

Paper Structure

This paper contains 30 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustrative failure: subject and caption align locally, but the scene is globally implausible. Our checker targets the FG--BG relationship directly.
  • Figure 2: Segmentation-Guided Scoring (SGS) pipeline. Any segmenter can supply the foreground and background crops. Each crop is captioned independently (BLIP), embedded into a semantic space (MiniLM), and compared via cosine similarity. A low similarity score signals FG--BG inconsistency.
  • Figure 3: Integration of SGS with HAMMER. (a) If SGS deems the subject–scene pair consistent, HAMMER is applied directly. (b) If SGS flags a mismatch, the image and caption are not routed into HAMMER, saving computational power.