Table of Contents
Fetching ...

V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

Seyed Mahed Mousavi, Christian Moiola, Massimo Rizzoli, Simone Alghisi, Giuseppe Riccardi

Abstract

Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are typically static, implicitly treating factual knowledge as time-invariant. However, real-world facts are intrinsically time-sensitive and subject to erratic and periodic changes, causing model predictions to become outdated. We present V-DyKnow, a Visual Dynamic Knowledge benchmark for evaluating time-sensitive factual knowledge in VLMs. Using V-DyKnow, we benchmark closed- and open-source VLMs and analyze a) the reliability (correctness and consistency) of model responses across modalities and input perturbations; b) the efficacy of knowledge editing and multi-modal RAG methods for knowledge updates across modalities; and c) the sources of outdated predictions, through data and mechanistic analysis. Our results show that VLMs frequently output outdated facts, reflecting outdated snapshots used in the (pre-)training phase. Factual reliability degrades from textual to visual stimuli, even when entities are correctly recognized. Besides, existing alignment approaches fail to consistently update the models' knowledge across modalities. Together, these findings highlight fundamental limitations in how current VLMs acquire and update time-sensitive knowledge across modalities. We release the benchmark, code, and evaluation data.

V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

Abstract

Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are typically static, implicitly treating factual knowledge as time-invariant. However, real-world facts are intrinsically time-sensitive and subject to erratic and periodic changes, causing model predictions to become outdated. We present V-DyKnow, a Visual Dynamic Knowledge benchmark for evaluating time-sensitive factual knowledge in VLMs. Using V-DyKnow, we benchmark closed- and open-source VLMs and analyze a) the reliability (correctness and consistency) of model responses across modalities and input perturbations; b) the efficacy of knowledge editing and multi-modal RAG methods for knowledge updates across modalities; and c) the sources of outdated predictions, through data and mechanistic analysis. Our results show that VLMs frequently output outdated facts, reflecting outdated snapshots used in the (pre-)training phase. Factual reliability degrades from textual to visual stimuli, even when entities are correctly recognized. Besides, existing alignment approaches fail to consistently update the models' knowledge across modalities. Together, these findings highlight fundamental limitations in how current VLMs acquire and update time-sensitive knowledge across modalities. We release the benchmark, code, and evaluation data.
Paper Structure (21 sections, 3 figures, 12 tables)

This paper contains 21 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: An example of multimodal querying VLMs for factual knowledge that is time-sensitive. Upon a visual stimulus, the VLM first misidentifies the entity and retrieves an incorrect fact. Following the clarification turn about the entity, the model generates an outdated answer. In the final turn, the entity is explicitly stated in the text, and the correct fact is finally returned. This example highlights key issues investigated in this work: the prevalence of outdated knowledge in VLMs and the performance gap in VLMs for visual and textual stimuli.
  • Figure 2: Temporal distribution of the model responses based on Wikidata. For each VLM, we map their correct and outdated responses (e.g., "The CEO of Apple is Steve Jobs") to the time at which the corresponding attribute was valid (e.g., "1997-2011"). By aggregating these intervals using a boxplot, we can approximate the state of the world encoded in the model's parameters. For example, results show that, while Qwen2-VL responses range between 2004 and 2023 (in line with its reported cutoff), most of them are concentrated between 2013 and 2019.
  • Figure 3: Mechanistic analysis of Qwen2-VL illustrating how editing methods modify the probability for certain attributes (y-axis) across different layers (x-axis) when asked about an Image-Question pair. In case of a successful update (top), WISE and GRACE affect different layers: WISE primarily edits the final layer, whereas GRACE modifies a broader range. In contrast, neither method successfully updates the basketball team of Paul George (bottom), suggesting that this knowledge may be more deeply embedded or that other stored facts may interfere.