Table of Contents
Fetching ...

DGM4+: Dataset Extension for Global Scene Inconsistency

Gagandeep Singh, Samudi Amarsinghe, Priyanka Singh, Xue Li

TL;DR

This work extends the DGM4 dataset to address global scene inconsistencies by introducing FG-BG mismatches and hybrids with text edits, via 5,000 synthetic samples generated with gpt-image-1 and news-style captions. It defines three new manipulation classes—FG-BG, FG-BG+TA, and FG-BG+TS—with rigorous quality controls (face count, OCR scrubbing, deduplication) and a standardized 400x256 crop. Through experiments with baselines spanning contrastive encoders, vision-only features, multimodal LLMs, and HAMMER, the study shows that global plausibility signals exist and can be leveraged, but current detectors (notably HAMMER) lack explicit FG-BG supervision. The results advocate for models that integrate vision-only cues, contrastive alignment gaps, and structured reasoning to robustly detect both local and global multimodal manipulations, and publicly release the DGM4+ dataset and generation scripts to catalyze future work.

Abstract

The rapid advances in generative models have significantly lowered the barrier to producing convincing multimodal disinformation. Fabricated images and manipulated captions increasingly co-occur to create persuasive false narratives. While the Detecting and Grounding Multi-Modal Media Manipulation (DGM4) dataset established a foundation for research in this area, it is restricted to local manipulations such as face swaps, attribute edits, and caption changes. This leaves a critical gap: global inconsistencies, such as mismatched foregrounds and backgrounds, which are now prevalent in real-world forgeries. To address this, we extend DGM4 with 5,000 high-quality samples that introduce Foreground-Background (FG-BG) mismatches and their hybrids with text manipulations. Using OpenAI's gpt-image-1 and carefully designed prompts, we generate human-centric news-style images where authentic figures are placed into absurd or impossible backdrops (e.g., a teacher calmly addressing students on the surface of Mars). Captions are produced under three conditions: literal, text attribute, and text split, yielding three new manipulation categories: FG-BG, FG-BG+TA, and FG-BG+TS. Quality control pipelines enforce one-to-three visible faces, perceptual hash deduplication, OCR-based text scrubbing, and realistic headline length. By introducing global manipulations, our extension complements existing datasets, creating a benchmark DGM4+ that tests detectors on both local and global reasoning. This resource is intended to strengthen evaluation of multimodal models such as HAMMER, which currently struggle with FG-BG inconsistencies. We release our DGM4+ dataset and generation script at https://github.com/Gaganx0/DGM4plus

DGM4+: Dataset Extension for Global Scene Inconsistency

TL;DR

This work extends the DGM4 dataset to address global scene inconsistencies by introducing FG-BG mismatches and hybrids with text edits, via 5,000 synthetic samples generated with gpt-image-1 and news-style captions. It defines three new manipulation classes—FG-BG, FG-BG+TA, and FG-BG+TS—with rigorous quality controls (face count, OCR scrubbing, deduplication) and a standardized 400x256 crop. Through experiments with baselines spanning contrastive encoders, vision-only features, multimodal LLMs, and HAMMER, the study shows that global plausibility signals exist and can be leveraged, but current detectors (notably HAMMER) lack explicit FG-BG supervision. The results advocate for models that integrate vision-only cues, contrastive alignment gaps, and structured reasoning to robustly detect both local and global multimodal manipulations, and publicly release the DGM4+ dataset and generation scripts to catalyze future work.

Abstract

The rapid advances in generative models have significantly lowered the barrier to producing convincing multimodal disinformation. Fabricated images and manipulated captions increasingly co-occur to create persuasive false narratives. While the Detecting and Grounding Multi-Modal Media Manipulation (DGM4) dataset established a foundation for research in this area, it is restricted to local manipulations such as face swaps, attribute edits, and caption changes. This leaves a critical gap: global inconsistencies, such as mismatched foregrounds and backgrounds, which are now prevalent in real-world forgeries. To address this, we extend DGM4 with 5,000 high-quality samples that introduce Foreground-Background (FG-BG) mismatches and their hybrids with text manipulations. Using OpenAI's gpt-image-1 and carefully designed prompts, we generate human-centric news-style images where authentic figures are placed into absurd or impossible backdrops (e.g., a teacher calmly addressing students on the surface of Mars). Captions are produced under three conditions: literal, text attribute, and text split, yielding three new manipulation categories: FG-BG, FG-BG+TA, and FG-BG+TS. Quality control pipelines enforce one-to-three visible faces, perceptual hash deduplication, OCR-based text scrubbing, and realistic headline length. By introducing global manipulations, our extension complements existing datasets, creating a benchmark DGM4+ that tests detectors on both local and global reasoning. This resource is intended to strengthen evaluation of multimodal models such as HAMMER, which currently struggle with FG-BG inconsistencies. We release our DGM4+ dataset and generation script at https://github.com/Gaganx0/DGM4plus

Paper Structure

This paper contains 33 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Examples for each class present in the DGM4++ dataset. The HAMMER model fails on the newly introduced three classes (j)-(l).
  • Figure 2: Illustration of the HAMMER model. The [CLS] token of the image and text embeddings are projected into a smaller dimensionality and aligned with manipulation aware contrastive learning. The image and text embeddings are cross-attended, and the resulting patch tokens are subject to Local Patch Attentional Aggregation (LPAA) to obtain an [AGG] token, which is input to the BBox Detector for manipulation bbox detection. The image and text embeddings are input to the Multi-Modal Aggregator to obtain an aggregated multi-modal embedding. The [CLS] token of the multi-modal embedding is fed into the Binary Classifier and Multi-Label Classifier to obtain binary and fine-grained manipulation detection labels. The remainder of the aggregated tokens are input to the Token Detectors, which predict labels for each token.
  • Figure 3: Distribution of the extended dataset across pristine and manipulation categories. The additional 5,000 samples cover the Foreground-Background class and its hybrids. (a) Dataset class types; (b) Original DGM4 dataset distribution; (c) Extended DGM4++ distribution