Table of Contents
Fetching ...

Verifying Cross-modal Entity Consistency in News using Vision-language Models

Sahar Tahmasebi, David Ernst, Eric Müller-Budack, Ralph Ewerth

TL;DR

This work tackles the problem of disinformation in multimodal news by verifying cross-modal consistency at the level of individual entities (persons, locations, events) rather than the entire document. It introduces LVLM4CEC, a zero-shot framework that leverages large vision-language models, prompt-based questioning, and web-sourced evidence images to assess per-entity coherence between text and images. The authors extend three multimodal news datasets with manual ground-truth annotations for entity verification and demonstrate that LVLMs achieve strong zero-shot performance, with notable gains when evidence images are used, and they outperform a CNN-based document-verification baseline for location and event verifications. The work provides public datasets and code, highlighting practical impact for automated fact-checking and offering avenues to improve location-specific verification and expansion to additional entity types like times and organizations.

Abstract

The web has become a crucial source of information, but it is also used to spread disinformation, often conveyed through multiple modalities like images and text. The identification of inconsistent cross-modal information, in particular entities such as persons, locations, and events, is critical to detect disinformation. Previous works either identify out-of-context disinformation by assessing the consistency of images to the whole document, neglecting relations of individual entities, or focus on generic entities that are not relevant to news. So far, only few approaches have addressed the task of validating entity consistency between images and text in news. However, the potential of large vision-language models (LVLMs) has not been explored yet. In this paper, we propose an LVLM-based framework for verifying Cross-modal Entity Consistency~(LVLM4CEC), to assess whether persons, locations and events in news articles are consistent across both modalities. We suggest effective prompting strategies for LVLMs for entity verification that leverage reference images crawled from web. Moreover, we extend three existing datasets for the task of entity verification in news providing manual ground-truth data. Our results show the potential of LVLMs for automating cross-modal entity verification, showing improved accuracy in identifying persons and events when using evidence images. Moreover, our method outperforms a baseline for location and event verification in documents. The datasets and source code are available on GitHub at https://github.com/TIBHannover/LVLM4CEC.

Verifying Cross-modal Entity Consistency in News using Vision-language Models

TL;DR

This work tackles the problem of disinformation in multimodal news by verifying cross-modal consistency at the level of individual entities (persons, locations, events) rather than the entire document. It introduces LVLM4CEC, a zero-shot framework that leverages large vision-language models, prompt-based questioning, and web-sourced evidence images to assess per-entity coherence between text and images. The authors extend three multimodal news datasets with manual ground-truth annotations for entity verification and demonstrate that LVLMs achieve strong zero-shot performance, with notable gains when evidence images are used, and they outperform a CNN-based document-verification baseline for location and event verifications. The work provides public datasets and code, highlighting practical impact for automated fact-checking and offering avenues to improve location-specific verification and expansion to additional entity types like times and organizations.

Abstract

The web has become a crucial source of information, but it is also used to spread disinformation, often conveyed through multiple modalities like images and text. The identification of inconsistent cross-modal information, in particular entities such as persons, locations, and events, is critical to detect disinformation. Previous works either identify out-of-context disinformation by assessing the consistency of images to the whole document, neglecting relations of individual entities, or focus on generic entities that are not relevant to news. So far, only few approaches have addressed the task of validating entity consistency between images and text in news. However, the potential of large vision-language models (LVLMs) has not been explored yet. In this paper, we propose an LVLM-based framework for verifying Cross-modal Entity Consistency~(LVLM4CEC), to assess whether persons, locations and events in news articles are consistent across both modalities. We suggest effective prompting strategies for LVLMs for entity verification that leverage reference images crawled from web. Moreover, we extend three existing datasets for the task of entity verification in news providing manual ground-truth data. Our results show the potential of LVLMs for automating cross-modal entity verification, showing improved accuracy in identifying persons and events when using evidence images. Moreover, our method outperforms a baseline for location and event verification in documents. The datasets and source code are available on GitHub at https://github.com/TIBHannover/LVLM4CEC.
Paper Structure (24 sections, 3 figures, 4 tables)

This paper contains 24 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Example of cross-modal entity verification. Image is replaced with similar one due to license restrictions. Original image is linked on the GitHub.
  • Figure 2: Pipeline for entity consistency verification with (bottom) and without (top) using evidence images. The model assesses whether or not an entity $e \in \mathbb{E}$ is visible, i.e., shares a cross-modal relation, in the news image $I$. Green indicates valid relations, while red denotes invalid relations.
  • Figure 3: Entity verification with and without image evidence across models. Green text box indicates correct predictions; red text indicates incorrect ones. Green borders show visible entities; red borders show invisible ones. As the baseline only outputs a similarity score of Cross-modal Similarities (CMS), we classify CMS values above 0.65 as the 'Yes' class and values below 0.65 as the 'No' class. Images are replaced with similar ones due to license restrictions. Original images are linked on the GitHub.