Bridging Semantic Logic Gaps: A Cognition Inspired Multimodal Boundary Preserving Network for Image Manipulation Localization
Songlin Li, Zhiqing Guo, Yuanman Li, Zeyu Li, Yunfeng Diao, Gaobo Yang, Liejun Wang
TL;DR
This work addresses the limited semantic understanding in image manipulation localization (IML) by introducing CMB-Net, which fuses image features with LLM-generated textual prompts. The architecture includes ITCAM to suppress LLM hallucinations, ITIM to enable fine-grained cross-modal interactions, and RED, based on invertible neural networks, to preserve boundary information. The model is trained with an overall loss $L_{all}$ that combines multi-level mask and boundary supervision, and achieves state-of-the-art results on multiple benchmarks, including out-of-domain datasets, demonstrating strong generalization. By bridging semantic gaps and preserving boundaries, CMB-Net offers robust manipulation localization in complex scenes and practical deployment potential; the authors also provide code at the project URL.
Abstract
The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition inspired multimodal boundary preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models. Our code is available on https://github.com/vpsg-research/CMB-Net.
