SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies

Gagandeep Singh; Samudi Amarsinghe; Urawee Thani; Ki Fung Wong; Priyanka Singh; Xue Li

SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies

Gagandeep Singh, Samudi Amarsinghe, Urawee Thani, Ki Fung Wong, Priyanka Singh, Xue Li

TL;DR

The paper tackles the blind spot of global FG--BG mismatches in multimodal disinformation detection by augmenting HAMMER with Segmentation-Guided Scoring (SGS), an inference-only pipeline that uses FG/BG segmentation to generate captions and compare semantic coherence in text space. SGS operates without retraining, producing region-aware scores that are fused with HAMMER to improve detection, grounding, and explanations. Experiments on an FG--BG inconsistent split show SGS as a strong standalone probe (high F1) and reveal complementary signals in contrastive and vision-only baselines, supporting the practical value of region-level reasoning. The work demonstrates that integrating lightweight, region-aware cues can significantly bolster robustness to global manipulations in multimodal disinformation, offering a scalable and reusable extension for existing detectors.

Abstract

We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER's original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at https://github.com/Gaganx0/HAMMER-sgs

SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies

TL;DR

Abstract

SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)