Table of Contents
Fetching ...

Bridging Semantic Logic Gaps: A Cognition Inspired Multimodal Boundary Preserving Network for Image Manipulation Localization

Songlin Li, Zhiqing Guo, Yuanman Li, Zeyu Li, Yunfeng Diao, Gaobo Yang, Liejun Wang

TL;DR

This work addresses the limited semantic understanding in image manipulation localization (IML) by introducing CMB-Net, which fuses image features with LLM-generated textual prompts. The architecture includes ITCAM to suppress LLM hallucinations, ITIM to enable fine-grained cross-modal interactions, and RED, based on invertible neural networks, to preserve boundary information. The model is trained with an overall loss $L_{all}$ that combines multi-level mask and boundary supervision, and achieves state-of-the-art results on multiple benchmarks, including out-of-domain datasets, demonstrating strong generalization. By bridging semantic gaps and preserving boundaries, CMB-Net offers robust manipulation localization in complex scenes and practical deployment potential; the authors also provide code at the project URL.

Abstract

The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition inspired multimodal boundary preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models. Our code is available on https://github.com/vpsg-research/CMB-Net.

Bridging Semantic Logic Gaps: A Cognition Inspired Multimodal Boundary Preserving Network for Image Manipulation Localization

TL;DR

This work addresses the limited semantic understanding in image manipulation localization (IML) by introducing CMB-Net, which fuses image features with LLM-generated textual prompts. The architecture includes ITCAM to suppress LLM hallucinations, ITIM to enable fine-grained cross-modal interactions, and RED, based on invertible neural networks, to preserve boundary information. The model is trained with an overall loss that combines multi-level mask and boundary supervision, and achieves state-of-the-art results on multiple benchmarks, including out-of-domain datasets, demonstrating strong generalization. By bridging semantic gaps and preserving boundaries, CMB-Net offers robust manipulation localization in complex scenes and practical deployment potential; the authors also provide code at the project URL.

Abstract

The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition inspired multimodal boundary preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models. Our code is available on https://github.com/vpsg-research/CMB-Net.

Paper Structure

This paper contains 19 sections, 29 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Comparison with mainstream IML methods. We use LLMs to analyze potential manipulated regions in images and generate prompt-based textual information to enhance visual features. In addition, the text features are weighted by quantifying the ambiguity of images and texts, which solves the inaccurate localization caused by the hallucination problem of LLMs.
  • Figure 2: The overall architecture of CMB-Net. It uses PVTv2 wang2022pvt as the visual encoder and BERT kenton2019bert as the text encoder. Additionally, it includes three main modules: the image-text central ambiguity module (ITCAM), the image-text interaction module (ITIM), and the restoration edge decoder (RED). It is worth noting that RED consists of four decoder blocks (DB). Each DB contains two components: the edge-guided residual module (EGRM) and the edge refinement module (ERM).
  • Figure 3: The text generated by LLMs is not always reliable. In (I), we present four scenarios, where (a) represents the expected answer, while (b), (c), and (d) are ambiguous or incorrect answers. Furthermore, not all image and text information is useful. For instance, we consider the red words in the answer as having a significant impact on locating the manipulated area in the image. In (II), the ITCAM workflow is shown, where the ambiguity value of the image-text pair is computed by selecting central features. This method reduces interference from redundant data and enhances the ambiguity value’s representativeness.
  • Figure 4: The architecture of Image-Text Interaction Module. $\alpha(\cdot)$, $\beta(\cdot)$, $\gamma(\cdot)$, $\delta(\cdot)$, and $\theta(\cdot)$ are all 1$\times$1 convolutions
  • Figure 5: The architecture of Decoder Block.
  • ...and 6 more figures