Table of Contents
Fetching ...

Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

Zizhao Chen, Ping Wei, Ziyang Ren, Huan Li, Xiangru Yin

Abstract

As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.

Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

Abstract

As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.

Paper Structure

This paper contains 21 sections, 16 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of the misinformation and our framework. (a) A manipulated media content. (b) Traditional methods. (c) Our MaLSF framework. We utilize the BCV and HSA modules to perform explicit verification and aggregation of local semantics.
  • Figure 2: Parsers for extracting mask-label pairs. Open Vocabulary Parser generates mask-label pairs end-to-end. Caption-Anchored Parser utilizes GLIP li2022grounded to obtain labels and corresponding bboxes, and then SAM2 ravi2024sam to obtain refined masks.
  • Figure 3: Overall architecture of MaLSF. In the figure, we take N=3 mask-label pairs as an example. The mask-label pairs combined with the original text and image are encoded as text features $\mathbf{l}_{\text{cap}},\mathbf{l}_{1},\mathbf{l}_{2},\mathbf{l}_{3}$ and image features $\mathbf{V}_{\text{img}},\mathbf{V}_{1},\mathbf{V}_{2},\mathbf{V}_{3}$ respectively. These features are subsequently encoded by BCV as multi-granularity validation features $\mathbf{F}_{\text{cap}}^{\text{img}}, \mathbf{F}^{\text{cap}}_{\text{img}},\mathbf{F}_{\text{cap}}^{i},\mathbf{F}^{i}_{\text{img}},i=1,2,3$.The HSA further aggregates the validation features to finally get the features $s_{\text{cap}},c_l,c_v,c_b,c_{\text{bbox}}$ used for different tasks. TA, TS, FA, and FS are different text and face manipulation types, respectively.
  • Figure 4: Architecture of Hierarchical Semantic Aggregation. All the verification features are put through multi-label shallow fusion and multi-label deep fusion, respectively, to get the final features input to the linear header.
  • Figure 5: Qualitative analysis of results for HAMMER++ and MaLSF. The red bbox and text indicate the model prediction results. The blue text boxes and bounding boxes are ground truth. MaLSF is superior in both detection and grounding tasks.
  • ...and 1 more figures