Table of Contents
Fetching ...

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

Zhenxing Zhang, Yaxiong Wang, Lechao Cheng, Zhun Zhong, Dan Guo, Meng Wang

TL;DR

ASAP addresses the DGM4 problem by focusing on fine-grained cross-modal semantic alignment between image and text. The approach combines Large Model-assisted Alignment (LMA) to generate auxiliary captions and explanations, Manipulation-Guided Cross Attention (MGCA) to direct model focus toward manipulated components, and Patch Manipulation Modeling (PMM) to provide local grounding priors; these components are integrated through a unified training loss that augments the standard DGM4 objectives. Empirical results on the HAMMER-derived DGM4 dataset show that ASAP achieves top performance across manipulation detection, manipulation type identification, image grounding, and text grounding, with substantial improvements over HAMMER baselines and related methods. The training-time alignment strategies rely on auxiliary texts and guidance masks, yet incur no inference-time overhead, highlighting practical benefits for robust, fine-grained multi-modal manipulation detection and grounding. Overall, ASAP advances cross-modal alignment as a central mechanism for improving DGM4, offering a scalable and effective framework for real-world media integrity analysis.

Abstract

We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4).Upon thorough examination, we observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding. While existing DGM4 methods pay rare attention to the cross-modal alignment, hampering the accuracy of manipulation detecting to step further. To remedy this issue, this work targets to advance the semantic alignment learning to promote this task. Particularly, we utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs, especially for the manipulated instances. Subsequently, a cross-modal alignment learning is performed to enhance the semantic alignment. Besides the explicit auxiliary clues, we further design a Manipulation-Guided Cross Attention (MGCA) to provide implicit guidance for augmenting the manipulation perceiving. With the grounding truth available during training, MGCA encourages the model to concentrate more on manipulated components while downplaying normal ones, enhancing the model's ability to capture manipulations. Extensive experiments are conducted on the DGM4 dataset, the results demonstrate that our model can surpass the comparison method with a clear margin.

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

TL;DR

ASAP addresses the DGM4 problem by focusing on fine-grained cross-modal semantic alignment between image and text. The approach combines Large Model-assisted Alignment (LMA) to generate auxiliary captions and explanations, Manipulation-Guided Cross Attention (MGCA) to direct model focus toward manipulated components, and Patch Manipulation Modeling (PMM) to provide local grounding priors; these components are integrated through a unified training loss that augments the standard DGM4 objectives. Empirical results on the HAMMER-derived DGM4 dataset show that ASAP achieves top performance across manipulation detection, manipulation type identification, image grounding, and text grounding, with substantial improvements over HAMMER baselines and related methods. The training-time alignment strategies rely on auxiliary texts and guidance masks, yet incur no inference-time overhead, highlighting practical benefits for robust, fine-grained multi-modal manipulation detection and grounding. Overall, ASAP advances cross-modal alignment as a central mechanism for improving DGM4, offering a scalable and effective framework for real-world media integrity analysis.

Abstract

We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4).Upon thorough examination, we observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding. While existing DGM4 methods pay rare attention to the cross-modal alignment, hampering the accuracy of manipulation detecting to step further. To remedy this issue, this work targets to advance the semantic alignment learning to promote this task. Particularly, we utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs, especially for the manipulated instances. Subsequently, a cross-modal alignment learning is performed to enhance the semantic alignment. Besides the explicit auxiliary clues, we further design a Manipulation-Guided Cross Attention (MGCA) to provide implicit guidance for augmenting the manipulation perceiving. With the grounding truth available during training, MGCA encourages the model to concentrate more on manipulated components while downplaying normal ones, enhancing the model's ability to capture manipulations. Extensive experiments are conducted on the DGM4 dataset, the results demonstrate that our model can surpass the comparison method with a clear margin.

Paper Structure

This paper contains 21 sections, 17 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Fine-grained understanding of the multimodal media is one of keys for detecting the manipulated media. The capture of the misaligned components between the image and the text can effectively assist the DGM4 task.
  • Figure 2: Illustration of our proposed ASAP framework. We employ a Multimodal Large Language Model (MLLM) to generate captions and a Large Language Model (LLM) to produce explanation texts for social media image-text pairs. These, along with the image, are encoded to obtain feature representations. Our Large Model-assisted Alignment (LMA) module enhances cross-modal alignment, followed by two Multimodal Encoders with Manipulation-Guided Cross Attention (MGCA) to integrate features for task-specific representations. One encoder is vision-biased for image grounding, and the other is text-biased for text grounding. The combined features from both encoders are used for media authenticity detection and manipulation identification. The network is optimized using DGM losses and objectives from LMA and MGCA.
  • Figure 3: Illustration of the generation of image caption (left) and explanation text (right). The auxiliary texts can be effectively harvested via the off-the-shelf large models with the carefully crafted instructions.
  • Figure 4: Illustration of constructing the indicator mask. According to the manipulated region bounding box, the patches that ovelap with the box is taken the positive samples, while the adjacent patches to the positive patches are negative ones. The other patches are ignored.
  • Figure 5: Effect of MGCA and PMM Loss on Attention Map Visualization. The red rectangle represents the bounding box of the manipulated face, and the red text indicates the manipulated word. (a) and (b) show the attention visualization between the manipulated word and the image. (c) shows the attention visualization between the entire sentence and the image. (d) presents the model’s prediction compared to the Ground Truth.
  • ...and 1 more figures