Large Language Models and Provenance Metadata for Determining the Relevance of Images and Videos in News Stories
Tomas Peterka, Matyas Bohacek
TL;DR
The paper addresses the challenge of detecting misinformation that leverages multimodal media by leveraging provenance metadata within a large-language-model (LLM) framework. It proposes a framework that ingests a news article, attached media captions, and provenance metadata to assess whether the media is relevant to the story and whether it has been tampered with, outputting location relevance, tampering status, and an overall relevance verdict. A concrete prototype is implemented using Newspaper4k for article scraping, the C2PA standard for provenance, and the Phi-3 LLM with a Gradio-based web interface, highlighting its open-source MIT-licensed release. The work acknowledges limitations such as LLM hallucinations, sparse provenance adoption in practice, lack of dedicated datasets for evaluation, and potential biases, and outlines concrete directions for future evaluation and dataset creation.
Abstract
The most effective misinformation campaigns are multimodal, often combining text with images and videos taken out of context -- or fabricating them entirely -- to support a given narrative. Contemporary methods for detecting misinformation, whether in deepfakes or text articles, often miss the interplay between multiple modalities. Built around a large language model, the system proposed in this paper addresses these challenges. It analyzes both the article's text and the provenance metadata of included images and videos to determine whether they are relevant. We open-source the system prototype and interactive web interface.
