Table of Contents
Fetching ...

EvoWiki: Evaluating LLMs on Evolving Knowledge

Wei Tang, Yixin Cao, Yang Deng, Jiahao Ying, Bo Wang, Yizhe Yang, Yuyue Zhao, Qi Zhang, Xuanjing Huang, Yugang Jiang, Yong Liao

TL;DR

EvoWiki tackles the gap in evaluating LLM knowledge utilization under evolving information by introducing a continually auto-updatable benchmark that classifies facts into stable, evolved, and uncharted states and anchors them to Wikidata/Wikipedia via distant supervision. It enables contamination-free, time-aware assessment of how models leverage external knowledge, both with and without retrieval augmentation and continual learning. The study finds that current RAG and CL approaches struggle with evolving knowledge, though combining them yields synergistic gains, especially for multi-hop tasks. EvoWiki provides a practical benchmark for advancing knowledge-evolution research in LLMs and informing deployment in dynamic real-world contexts.

Abstract

Knowledge utilization is a critical aspect of LLMs, and understanding how they adapt to evolving knowledge is essential for their effective deployment. However, existing benchmarks are predominantly static, failing to capture the evolving nature of LLMs and knowledge, leading to inaccuracies and vulnerabilities such as contamination. In this paper, we introduce EvoWiki, an evolving dataset designed to reflect knowledge evolution by categorizing information into stable, evolved, and uncharted states. EvoWiki is fully auto-updatable, enabling precise evaluation of continuously changing knowledge and newly released LLMs. Through experiments with Retrieval-Augmented Generation (RAG) and Contunual Learning (CL), we evaluate how effectively LLMs adapt to evolving knowledge. Our results indicate that current models often struggle with evolved knowledge, frequently providing outdated or incorrect responses. Moreover, the dataset highlights a synergistic effect between RAG and CL, demonstrating their potential to better adapt to evolving knowledge. EvoWiki provides a robust benchmark for advancing future research on the knowledge evolution capabilities of large language models.

EvoWiki: Evaluating LLMs on Evolving Knowledge

TL;DR

EvoWiki tackles the gap in evaluating LLM knowledge utilization under evolving information by introducing a continually auto-updatable benchmark that classifies facts into stable, evolved, and uncharted states and anchors them to Wikidata/Wikipedia via distant supervision. It enables contamination-free, time-aware assessment of how models leverage external knowledge, both with and without retrieval augmentation and continual learning. The study finds that current RAG and CL approaches struggle with evolving knowledge, though combining them yields synergistic gains, especially for multi-hop tasks. EvoWiki provides a practical benchmark for advancing knowledge-evolution research in LLMs and informing deployment in dynamic real-world contexts.

Abstract

Knowledge utilization is a critical aspect of LLMs, and understanding how they adapt to evolving knowledge is essential for their effective deployment. However, existing benchmarks are predominantly static, failing to capture the evolving nature of LLMs and knowledge, leading to inaccuracies and vulnerabilities such as contamination. In this paper, we introduce EvoWiki, an evolving dataset designed to reflect knowledge evolution by categorizing information into stable, evolved, and uncharted states. EvoWiki is fully auto-updatable, enabling precise evaluation of continuously changing knowledge and newly released LLMs. Through experiments with Retrieval-Augmented Generation (RAG) and Contunual Learning (CL), we evaluate how effectively LLMs adapt to evolving knowledge. Our results indicate that current models often struggle with evolved knowledge, frequently providing outdated or incorrect responses. Moreover, the dataset highlights a synergistic effect between RAG and CL, demonstrating their potential to better adapt to evolving knowledge. EvoWiki provides a robust benchmark for advancing future research on the knowledge evolution capabilities of large language models.

Paper Structure

This paper contains 29 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: EvoWiki categorizes knowledge into three states according to the cut-off date of the LLMs.
  • Figure 2: Evolution level identification process.
  • Figure 3: RAG performance across top-k values of Contriever; the dashed line represents closed-book QA results.
  • Figure 4: Probability shift (%) of CL methods on Llama for the first token of the golden answer.
  • Figure 5: Popularity effects of SFT on Llama. Due to data scarcity, we aggregated the popularity levels of 0 and 1 into a single category, as well as levels 5 and 6.