Table of Contents
Fetching ...

DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs

Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi

TL;DR

DyKnow tackles the problem of time-sensitive knowledge in large language models by introducing a dynamic benchmarking framework that validates model outputs against Wikidata's current values and historical validity intervals. It systematically evaluates 24 LLMs on 130 time-sensitive facts, revealing notable outdatedness and prompt-induced inconsistency across models. The study also tests four knowledge editing methods, finding model-dependent improvements with limited scalability, underscoring the need for robust, dynamic benchmarking and updating mechanisms. Overall, the work demonstrates the necessity of dynamic, real-time validation for trustworthy LLM knowledge and motivates community-driven expansion of the DyKnow benchmark.

Abstract

LLMs acquire knowledge from massive data snapshots collected at different timestamps. Their knowledge is then commonly evaluated using static benchmarks. However, factual knowledge is generally subject to time-sensitive changes, and static benchmarks cannot address those cases. We present an approach to dynamically evaluate the knowledge in LLMs and their time-sensitiveness against Wikidata, a publicly available up-to-date knowledge graph. We evaluate the time-sensitive knowledge in twenty-four private and open-source LLMs, as well as the effectiveness of four editing methods in updating the outdated facts. Our results show that 1) outdatedness is a critical problem across state-of-the-art LLMs; 2) LLMs output inconsistent answers when prompted with slight variations of the question prompt; and 3) the performance of the state-of-the-art knowledge editing algorithms is very limited, as they can not reduce the cases of outdatedness and output inconsistency.

DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs

TL;DR

DyKnow tackles the problem of time-sensitive knowledge in large language models by introducing a dynamic benchmarking framework that validates model outputs against Wikidata's current values and historical validity intervals. It systematically evaluates 24 LLMs on 130 time-sensitive facts, revealing notable outdatedness and prompt-induced inconsistency across models. The study also tests four knowledge editing methods, finding model-dependent improvements with limited scalability, underscoring the need for robust, dynamic benchmarking and updating mechanisms. Overall, the work demonstrates the necessity of dynamic, real-time validation for trustworthy LLM knowledge and motivates community-driven expansion of the DyKnow benchmark.

Abstract

LLMs acquire knowledge from massive data snapshots collected at different timestamps. Their knowledge is then commonly evaluated using static benchmarks. However, factual knowledge is generally subject to time-sensitive changes, and static benchmarks cannot address those cases. We present an approach to dynamically evaluate the knowledge in LLMs and their time-sensitiveness against Wikidata, a publicly available up-to-date knowledge graph. We evaluate the time-sensitive knowledge in twenty-four private and open-source LLMs, as well as the effectiveness of four editing methods in updating the outdated facts. Our results show that 1) outdatedness is a critical problem across state-of-the-art LLMs; 2) LLMs output inconsistent answers when prompted with slight variations of the question prompt; and 3) the performance of the state-of-the-art knowledge editing algorithms is very limited, as they can not reduce the cases of outdatedness and output inconsistency.
Paper Structure (8 sections, 6 figures, 6 tables)

This paper contains 8 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: LLMs A, B, and C may respond with outdated (Real Madrid, Juventus) and irrelevant (Lakers) responses, respectively, to the user question:"What is Cristiano Ronaldo's club?". Wikidata contains up-to-date information to assess the models' accuracy and time-sensitiveness.
  • Figure 2: The level of prompt agreement for each model across three prompts for each time-sensitive question. Subscripts ${I.}$ and ${C.}$ stand for Instruct and Chat, respectively. Instruction-tuned models demonstrate a comparatively higher prompt agreement.
  • Figure 3: Approximating the temporal interval of the data used for (pre-)training LLMs following our evaluation regarding time-sensitive knowledge. The y-axis presents the evaluated LLMs with their release year in parentheses. The box plots present the distribution of the generated responses for each LLM according to their validity interval. For instance, the responses of OpenELM 1.1B range from 2006 to 2020, with a concentrated period between 2012 and 2016, suggesting that the mode is trained on comparatively older datasets.
  • Figure 4: The scalability of editing algorithms for "updating" the outdated facts in GPT-2 and Llama-2$_{C.}$. The x-axis and y-axis represent the number of edits (in parenthesis the percentage of the total edits) and the harmonic mean of the models, respectively.
  • Figure 5: Approximating the temporal period of the data used for (pre-)training the models according to their correct and outdated outputs to our time-sensitive factual questions. The y-axis presents the evaluated LLMs with their release year in parentheses. The box plots present the distribution of the generated responses for each LLM according to their validity interval. Each box plot shows the interquartile range of the responses, with whiskers extending to the minimum and maximum dates.
  • ...and 1 more figures