Table of Contents
Fetching ...

Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding

Jiazhen Wang, Bin Liu, Changtao Miao, Zhiwei Zhao, Wanyi Zhuang, Qi Chu, Nenghai Yu

TL;DR

This paper constructs a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks and proposes an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details.

Abstract

AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the $\rm DGM^4$ dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.

Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding

TL;DR

This paper constructs a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks and proposes an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details.

Abstract

AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.
Paper Structure (15 sections, 9 equations, 2 figures, 3 tables)

This paper contains 15 sections, 9 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall architecture of our framework. 1) Image and text features are extracted and fused through uni-model encoders $E_i$, $E_t$, and modality interaction module $M_i$, $M_t$. 2) Decoupled fine-grained classifier $C_i$, $C_t$ and binary classifier $C_b$ take image embedding $i_{cls}$, text embedding $t_{cls}$, and concatenated embeddings $\{i_{cls}, t_{cls}\}$ as inputs, respectively. 3) Image embeddings $i_{pat}$ and text embeddings $t_{tok}$ are separately fed into the implicit manipulation query module and grounding heads.
  • Figure 2: Visualization of manipulation grounding results. Ground truths are in red, and predictions are in blue. The top three examples from HAMMER shao2023detecting, and the subsequent three examples from our model.