Table of Contents
Fetching ...

Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

Yiwei He, Zhenglin Huang, Haiquan Wen, Tianxiao Li, Yi Dong, Hao Fei, Baoyuan Wu, Guangliang Cheng

TL;DR

BiMi tackles the detection of bilingual multimodal misinformation in news by jointly localizing manipulated regions, assessing cross-modal and cross-lingual consistency, and generating faithful explanations. It introduces BiMiBench, a large-scale benchmark with 104k samples of manipulated images and bilingual subtitles, and BiMi, a retrieval-augmented framework built on Gemma3. A three-stage training pipeline—domain alignment, instruction tuning, and GRPO-based reasoning optimization—yields state-of-the-art results on BiMiBench and transfer to MMFakeBench. The work advances interpretable, multilingual multimodal misinformation detection with practical implications for real-world media scrutiny.

Abstract

The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.

Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

TL;DR

BiMi tackles the detection of bilingual multimodal misinformation in news by jointly localizing manipulated regions, assessing cross-modal and cross-lingual consistency, and generating faithful explanations. It introduces BiMiBench, a large-scale benchmark with 104k samples of manipulated images and bilingual subtitles, and BiMi, a retrieval-augmented framework built on Gemma3. A three-stage training pipeline—domain alignment, instruction tuning, and GRPO-based reasoning optimization—yields state-of-the-art results on BiMiBench and transfer to MMFakeBench. The work advances interpretable, multilingual multimodal misinformation detection with practical implications for real-world media scrutiny.

Abstract

The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.

Paper Structure

This paper contains 31 sections, 5 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Comparison of traditional vs. bilingual misinformation detection tasks. Traditional tasks focus on visual-text consistency with limited outputs (left). Our setting uses tampered images and bilingual subtitles, enabling richer outputs including region localization, cross-modal consistency, and explanation (right). Red indicates error, green indicates correctness. Best viewed in color.
  • Figure 2: Image quality comparison across datasets using the perceptual evaluation method of chen2024evaluating. BiMiBench achieves higher visual quality across dimensions than prior datasets, indicating closer alignment with real-world social media imagery.
  • Figure 3: The data generation workflow used in constructing the BiMiBench benchmark.
  • Figure 4: The overview of the training strategy. Three stages: domain alignment on news data, instruction tuning with task-specific prompts, and GRPO optimization with structured rewards.
  • Figure 5: Comparison of explanations from InternVL3 (middle) and our model (right). Top left: input; bottom left: original sample. Some responses are truncated due to space constraints.
  • ...and 12 more figures