Table of Contents
Fetching ...

MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation

Longzheng Wang, Xiaohan Xu, Lei Zhang, Jiarui Lu, Yongxiu Xu, Hongbo Xu, Minghao Tang, Chuang Zhang

TL;DR

MMIDR tackles the challenge of interpretable multimodal misinformation detection by converting image-text misinformation into instruction-following prompts, eliciting rationales from a teacher LLM, and distilling these insights into open-source LLMs via LoRA. The approach combines a visual-information processing pipeline (OCR and image captioning) with evidence retrieval to support reasoning, and it demonstrates strong detection performance along with reasoning capabilities on the MR_llm^2 dataset. While distilled students closely approach teacher performance, they still lag in faithfully reproducing the full reasoning, indicating areas for improving fidelity. Overall, MMIDR offers a cost-effective, scalable path to accessible, rationale-rich multimodal misinformation detectors.

Abstract

Automatic detection of multimodal misinformation has gained a widespread attention recently. However, the potential of powerful Large Language Models (LLMs) for multimodal misinformation detection remains underexplored. Besides, how to teach LLMs to interpret multimodal misinformation in cost-effective and accessible way is still an open question. To address that, we propose MMIDR, a framework designed to teach LLMs in providing fluent and high-quality textual explanations for their decision-making process of multimodal misinformation. To convert multimodal misinformation into an appropriate instruction-following format, we present a data augmentation perspective and pipeline. This pipeline consists of a visual information processing module and an evidence retrieval module. Subsequently, we prompt the proprietary LLMs with processed contents to extract rationales for interpreting the authenticity of multimodal misinformation. Furthermore, we design an efficient knowledge distillation approach to distill the capability of proprietary LLMs in explaining multimodal misinformation into open-source LLMs. To explore several research questions regarding the performance of LLMs in multimodal misinformation detection tasks, we construct an instruction-following multimodal misinformation dataset and conduct comprehensive experiments. The experimental findings reveal that our MMIDR exhibits sufficient detection performance and possesses the capacity to provide compelling rationales to support its assessments.

MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation

TL;DR

MMIDR tackles the challenge of interpretable multimodal misinformation detection by converting image-text misinformation into instruction-following prompts, eliciting rationales from a teacher LLM, and distilling these insights into open-source LLMs via LoRA. The approach combines a visual-information processing pipeline (OCR and image captioning) with evidence retrieval to support reasoning, and it demonstrates strong detection performance along with reasoning capabilities on the MR_llm^2 dataset. While distilled students closely approach teacher performance, they still lag in faithfully reproducing the full reasoning, indicating areas for improving fidelity. Overall, MMIDR offers a cost-effective, scalable path to accessible, rationale-rich multimodal misinformation detectors.

Abstract

Automatic detection of multimodal misinformation has gained a widespread attention recently. However, the potential of powerful Large Language Models (LLMs) for multimodal misinformation detection remains underexplored. Besides, how to teach LLMs to interpret multimodal misinformation in cost-effective and accessible way is still an open question. To address that, we propose MMIDR, a framework designed to teach LLMs in providing fluent and high-quality textual explanations for their decision-making process of multimodal misinformation. To convert multimodal misinformation into an appropriate instruction-following format, we present a data augmentation perspective and pipeline. This pipeline consists of a visual information processing module and an evidence retrieval module. Subsequently, we prompt the proprietary LLMs with processed contents to extract rationales for interpreting the authenticity of multimodal misinformation. Furthermore, we design an efficient knowledge distillation approach to distill the capability of proprietary LLMs in explaining multimodal misinformation into open-source LLMs. To explore several research questions regarding the performance of LLMs in multimodal misinformation detection tasks, we construct an instruction-following multimodal misinformation dataset and conduct comprehensive experiments. The experimental findings reveal that our MMIDR exhibits sufficient detection performance and possesses the capacity to provide compelling rationales to support its assessments.
Paper Structure (20 sections, 3 equations, 3 figures, 5 tables)

This paper contains 20 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Model Architecture Overview of MMIDR. Given multimodal information, we first convert it into an appropriate instruction-following format. Then we employ the labeling template to prompt the teacher LLM in order to obtain rationales for the authenticity labels assigned to provided multimodal information. Finally, we employ LoRA hu2021lora to train the student LLM on the integration of the original multimodal information and the rationales. $I(I+V)$ denotes that the input for student LLM could be textual instruction ($I$) or both textual instruction and image ($I+V$) based on whether student is multimodal.
  • Figure 2: The length distribution of training data composed of different numbers of evidence.
  • Figure 3: The classification accuracy of the teacher model using different numbers of evidence.