DGM4+: Dataset Extension for Global Scene Inconsistency
Gagandeep Singh, Samudi Amarsinghe, Priyanka Singh, Xue Li
TL;DR
This work extends the DGM4 dataset to address global scene inconsistencies by introducing FG-BG mismatches and hybrids with text edits, via 5,000 synthetic samples generated with gpt-image-1 and news-style captions. It defines three new manipulation classes—FG-BG, FG-BG+TA, and FG-BG+TS—with rigorous quality controls (face count, OCR scrubbing, deduplication) and a standardized 400x256 crop. Through experiments with baselines spanning contrastive encoders, vision-only features, multimodal LLMs, and HAMMER, the study shows that global plausibility signals exist and can be leveraged, but current detectors (notably HAMMER) lack explicit FG-BG supervision. The results advocate for models that integrate vision-only cues, contrastive alignment gaps, and structured reasoning to robustly detect both local and global multimodal manipulations, and publicly release the DGM4+ dataset and generation scripts to catalyze future work.
Abstract
The rapid advances in generative models have significantly lowered the barrier to producing convincing multimodal disinformation. Fabricated images and manipulated captions increasingly co-occur to create persuasive false narratives. While the Detecting and Grounding Multi-Modal Media Manipulation (DGM4) dataset established a foundation for research in this area, it is restricted to local manipulations such as face swaps, attribute edits, and caption changes. This leaves a critical gap: global inconsistencies, such as mismatched foregrounds and backgrounds, which are now prevalent in real-world forgeries. To address this, we extend DGM4 with 5,000 high-quality samples that introduce Foreground-Background (FG-BG) mismatches and their hybrids with text manipulations. Using OpenAI's gpt-image-1 and carefully designed prompts, we generate human-centric news-style images where authentic figures are placed into absurd or impossible backdrops (e.g., a teacher calmly addressing students on the surface of Mars). Captions are produced under three conditions: literal, text attribute, and text split, yielding three new manipulation categories: FG-BG, FG-BG+TA, and FG-BG+TS. Quality control pipelines enforce one-to-three visible faces, perceptual hash deduplication, OCR-based text scrubbing, and realistic headline length. By introducing global manipulations, our extension complements existing datasets, creating a benchmark DGM4+ that tests detectors on both local and global reasoning. This resource is intended to strengthen evaluation of multimodal models such as HAMMER, which currently struggle with FG-BG inconsistencies. We release our DGM4+ dataset and generation script at https://github.com/Gaganx0/DGM4plus
