Table of Contents
Fetching ...

COVE: COntext and VEracity prediction for out-of-context images

Jonathan Tonglet, Gabriel Thiem, Iryna Gurevych

TL;DR

COVE tackles multimodal misinformation from out-of-context images by first predicting a comprehensive image context using diverse web, Wikipedia, and automated caption evidence, and then predicting caption veracity from that context. The method demonstrates strong context prediction across seven items and delivers veracity performance competitive with or superior to baselines, especially on real-world data, while revealing that the predicted context can serve as a reusable artifact for human verification. The work shows that sequentially combining context grounding and veracity checking improves robustness against OOC misinformation and provides a practical, interpretable tool for fact-checkers, with a discussion of limitations and ethical considerations.

Abstract

Images taken out of their context are the most prevalent form of multimodal misinformation. Debunking them requires (1) providing the true context of the image and (2) checking the veracity of the image's caption. However, existing automated fact-checking methods fail to tackle both objectives explicitly. In this work, we introduce COVE, a new method that predicts first the true COntext of the image and then uses it to predict the VEracity of the caption. COVE beats the SOTA context prediction model on all context items, often by more than five percentage points. It is competitive with the best veracity prediction models on synthetic data and outperforms them on real-world data, showing that it is beneficial to combine the two tasks sequentially. Finally, we conduct a human study that reveals that the predicted context is a reusable and interpretable artifact to verify new out-of-context captions for the same image. Our code and data are made available.

COVE: COntext and VEracity prediction for out-of-context images

TL;DR

COVE tackles multimodal misinformation from out-of-context images by first predicting a comprehensive image context using diverse web, Wikipedia, and automated caption evidence, and then predicting caption veracity from that context. The method demonstrates strong context prediction across seven items and delivers veracity performance competitive with or superior to baselines, especially on real-world data, while revealing that the predicted context can serve as a reusable artifact for human verification. The work shows that sequentially combining context grounding and veracity checking improves robustness against OOC misinformation and provides a practical, interpretable tool for fact-checkers, with a discussion of limitations and ethical considerations.

Abstract

Images taken out of their context are the most prevalent form of multimodal misinformation. Debunking them requires (1) providing the true context of the image and (2) checking the veracity of the image's caption. However, existing automated fact-checking methods fail to tackle both objectives explicitly. In this work, we introduce COVE, a new method that predicts first the true COntext of the image and then uses it to predict the VEracity of the caption. COVE beats the SOTA context prediction model on all context items, often by more than five percentage points. It is competitive with the best veracity prediction models on synthetic data and outperforms them on real-world data, showing that it is beneficial to combine the two tasks sequentially. Finally, we conduct a human study that reveals that the predicted context is a reusable and interpretable artifact to verify new out-of-context captions for the same image. Our code and data are made available.

Paper Structure

This paper contains 35 sections, 16 figures, 8 tables.

Figures (16)

  • Figure 1: The two steps of COVE: (1) Generating the true context of the image. (2) Predicting the veracity of a caption by comparing it with the generated context.
  • Figure 2: The architecture of COVE consists of six steps. The first three are performed in parallel and consist of retrieving evidence. Step 4 predicts the context items in a QA setting. Step 5 updates missing items based on the existing ones and Wikipedia knowledge. Step 6 predicts the veracity of the caption based on the predicted context.
  • Figure 3: Wikipedia entities collection. The candidate set is composed of the entities in the caption and those that are most similar to the image. Candidates are retained if the similarity between the image and their name or their Wikipedia images passes a threshold.
  • Figure 4: Knowledge gap completion. Questions are generated based on the predicted context and answered with Wikipedia passages. If the answers are relevant, the context is updated.
  • Figure 5: Change in veracity prediction before and after seeing SNIFFER (top) or COVE (bottom) artifacts, for accurate (left) and OOC (right) captions.
  • ...and 11 more figures