Table of Contents
Fetching ...

SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection

Peng Qi, Zehong Yan, Wynne Hsu, Mong Li Lee

TL;DR

SNIFFER tackles out-of-context misinformation by pairing a two-stage instruction-tuned multimodal LLM with retrieval-based reasoning to detect inconsistencies between image and caption while explaining the rationale. By fine-tuning InstructBLIP's Q-Former on news-domain data and OOC-specific instructions generated with GPT-4, and by integrating external evidence via tools, the model achieves superior detection performance and generates persuasive explanations. Empirical results on NewsCLIPpings show state-of-the-art accuracy, data-efficient learning, and robust explainability, with strong human validation. The approach also demonstrates cross-dataset generalization and favorable comparisons to GPT-4V, underscoring the practical value of task-specific, explanation-enabled MLLMs for misinformation debunking.

Abstract

Misinformation is a prevalent societal issue due to its potential high risks. Out-of-context (OOC) misinformation, where authentic images are repurposed with false text, is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments, which is essential for debunking misinformation. While Multimodal Large Language Models (MLLMs) have rich knowledge and innate capability for visual reasoning and explanation generation, they still lack sophistication in understanding and discovering the subtle crossmodal differences. In this paper, we introduce SNIFFER, a novel multimodal large language model specifically engineered for OOC misinformation detection and explanation. SNIFFER employs two-stage instruction tuning on InstructBLIP. The first stage refines the model's concept alignment of generic objects with news-domain entities and the second stage leverages language-only GPT-4 generated OOC-specific instruction data to fine-tune the model's discriminatory powers. Enhanced by external tools and retrieval, SNIFFER not only detects inconsistencies between text and image but also utilizes external knowledge for contextual verification. Our experiments show that SNIFFER surpasses the original MLLM by over 40% and outperforms state-of-the-art methods in detection accuracy. SNIFFER also provides accurate and persuasive explanations as validated by quantitative and human evaluations.

SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection

TL;DR

SNIFFER tackles out-of-context misinformation by pairing a two-stage instruction-tuned multimodal LLM with retrieval-based reasoning to detect inconsistencies between image and caption while explaining the rationale. By fine-tuning InstructBLIP's Q-Former on news-domain data and OOC-specific instructions generated with GPT-4, and by integrating external evidence via tools, the model achieves superior detection performance and generates persuasive explanations. Empirical results on NewsCLIPpings show state-of-the-art accuracy, data-efficient learning, and robust explainability, with strong human validation. The approach also demonstrates cross-dataset generalization and favorable comparisons to GPT-4V, underscoring the practical value of task-specific, explanation-enabled MLLMs for misinformation debunking.

Abstract

Misinformation is a prevalent societal issue due to its potential high risks. Out-of-context (OOC) misinformation, where authentic images are repurposed with false text, is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments, which is essential for debunking misinformation. While Multimodal Large Language Models (MLLMs) have rich knowledge and innate capability for visual reasoning and explanation generation, they still lack sophistication in understanding and discovering the subtle crossmodal differences. In this paper, we introduce SNIFFER, a novel multimodal large language model specifically engineered for OOC misinformation detection and explanation. SNIFFER employs two-stage instruction tuning on InstructBLIP. The first stage refines the model's concept alignment of generic objects with news-domain entities and the second stage leverages language-only GPT-4 generated OOC-specific instruction data to fine-tune the model's discriminatory powers. Enhanced by external tools and retrieval, SNIFFER not only detects inconsistencies between text and image but also utilizes external knowledge for contextual verification. Our experiments show that SNIFFER surpasses the original MLLM by over 40% and outperforms state-of-the-art methods in detection accuracy. SNIFFER also provides accurate and persuasive explanations as validated by quantitative and human evaluations.
Paper Structure (19 sections, 1 equation, 10 figures, 6 tables)

This paper contains 19 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Comparison between the proposed Sniffer and other detectors. In this out-of-context misinformation, the individual in the image is Harry Thomas Jr, which contradicts the caption. Existing detectors often give a judgment without explanation. While InstructBLIP and GPT-4V correctly identify the inconsistent news element (i.e. person) in the image-text pair, they mistakenly associate the person in the image with a different individual mentioned in the caption. In contrast, Sniffer analyzes both the consistency of the image-text content and the claim-evidence relevance, and accurately identify the person in the image as Harry Thomas Jr, thereby providing a precise and persuasive explanation.
  • Figure 2: Architecture of the proposed framework Sniffer. For a given image-text pair, Sniffer conducts a two-pronged analysis: (1) it checks the consistency of the image and text content ( internal checking), and (2) it examines the relevance between the context of the retrieved image and the provided text ( external checking). The outcomes of both these verification processes are then considered by Sniffer to arrive at a final judgment and explanation.
  • Figure 3: Sniffer was initialized with the general-domain InstructBLIP and then continuously trained to adapt it to the news domain and OOC misinformation detection task sequentially.
  • Figure 4: Process of OOC instruction generation.
  • Figure 5: Response ratio.
  • ...and 5 more figures