Table of Contents
Fetching ...

"Image, Tell me your story!" Predicting the original meta-context of visual misinformation

Jonathan Tonglet, Marie-Francine Moens, Iryna Gurevych

TL;DR

This work introduces automated image contextualization to ground visual content in its original meta-context using the 5 Pillars framework (Provenance, Source, Date, Location, Motivation). It provides 5Pils, the first dataset of 1,676 images with pillar-based QA and a baseline that combines image cues with open-web evidence retrieved via RIS, evaluated with pillar-specific metrics. Results demonstrate that while the approach yields promising performance on some pillars (notably Location and Motivation) and validates the utility of multimodal inputs and CLIP-guided evidence ranking, several pillars (especially Source and Date) remain challenging, highlighting the need for better retrieval, multi-step reasoning, and evidence reliability assessment. The work lays a foundation for future automated tools that support human fact-checkers by producing contextualized grounding and prebunking content, thereby improving debunking quality and timeliness.

Abstract

To assist human fact-checkers, researchers have developed automated approaches for visual misinformation detection. These methods assign veracity scores by identifying inconsistencies between the image and its caption, or by detecting forgeries in the image. However, they neglect a crucial point of the human fact-checking process: identifying the original meta-context of the image. By explaining what is actually true about the image, fact-checkers can better detect misinformation, focus their efforts on check-worthy visual content, engage in counter-messaging before misinformation spreads widely, and make their explanation more convincing. Here, we fill this gap by introducing the task of automated image contextualization. We create 5Pils, a dataset of 1,676 fact-checked images with question-answer pairs about their original meta-context. Annotations are based on the 5 Pillars fact-checking framework. We implement a first baseline that grounds the image in its original meta-context using the content of the image and textual evidence retrieved from the open web. Our experiments show promising results while highlighting several open challenges in retrieval and reasoning. We make our code and data publicly available.

"Image, Tell me your story!" Predicting the original meta-context of visual misinformation

TL;DR

This work introduces automated image contextualization to ground visual content in its original meta-context using the 5 Pillars framework (Provenance, Source, Date, Location, Motivation). It provides 5Pils, the first dataset of 1,676 images with pillar-based QA and a baseline that combines image cues with open-web evidence retrieved via RIS, evaluated with pillar-specific metrics. Results demonstrate that while the approach yields promising performance on some pillars (notably Location and Motivation) and validates the utility of multimodal inputs and CLIP-guided evidence ranking, several pillars (especially Source and Date) remain challenging, highlighting the need for better retrieval, multi-step reasoning, and evidence reliability assessment. The work lays a foundation for future automated tools that support human fact-checkers by producing contextualized grounding and prebunking content, thereby improving debunking quality and timeliness.

Abstract

To assist human fact-checkers, researchers have developed automated approaches for visual misinformation detection. These methods assign veracity scores by identifying inconsistencies between the image and its caption, or by detecting forgeries in the image. However, they neglect a crucial point of the human fact-checking process: identifying the original meta-context of the image. By explaining what is actually true about the image, fact-checkers can better detect misinformation, focus their efforts on check-worthy visual content, engage in counter-messaging before misinformation spreads widely, and make their explanation more convincing. Here, we fill this gap by introducing the task of automated image contextualization. We create 5Pils, a dataset of 1,676 fact-checked images with question-answer pairs about their original meta-context. Annotations are based on the 5 Pillars fact-checking framework. We implement a first baseline that grounds the image in its original meta-context using the content of the image and textual evidence retrieved from the open web. Our experiments show promising results while highlighting several open challenges in retrieval and reasoning. We make our code and data publicly available.
Paper Structure (30 sections, 2 equations, 22 figures, 6 tables)

This paper contains 30 sections, 2 equations, 22 figures, 6 tables.

Figures (22)

  • Figure 1: An out-of-context image with a false caption. The original meta-context of the image is established by answering the questions of the 5 Pillars framework.
  • Figure 2: An annotated image of the 5Pils dataset. Above: the FC article. Below: the extracted 5 Pillars answers.
  • Figure 3: Left: Images with answers per pillar (%). Right: Images with n answered pillars (n=1,..,5) (%).
  • Figure 4: Our baseline for image contextualization. The retriever is a RIS engine that searches web-pages containing previous versions of the image. The top $k$ evidence and the image are provided as input to a QA model to answer the 5 Pillars questions. Manipulated images go through the additional step of identifying the original unaltered image.
  • Figure 5: Images from the 5Pils test set with zero-shot multimodal GPT4 predictions. The middle column shows excerpts of the evidence texts with relevant snippets. The right column shows correct, partially correct, and wrong predictions. GT is the ground truth answer.
  • ...and 17 more figures