Table of Contents
Fetching ...

Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Xin Gao, Ruiyi Zhang, Daniel Du, Saurabh Mahindre, Sai Ashish Somayajula, Pengtao Xie

TL;DR

This work addresses data contamination concerns in LLM-based temporal prediction by testing whether prompted knowledge cutoffs can simulate earlier knowledge. It builds three evaluation subsets—Factual, Semantic, and Counterfactual—to probe forgetting direct facts, semantic shifts, and causally related content, respectively, using two meta-prompts across three LLMs. Results show strong effectiveness for direct factual forgetting (~$82.5\%$) and semantic forgetting (~$70.0\%$) but limited ability to erase causally linked content (~$19.2\%$), underscoring the challenge of prompting-based unlearning for real-world temporal tasks. The study highlights the need for more robust evaluation methods and unlearning techniques, and provides datasets and code to support future research in fair and trustworthy temporal prediction with LLMs.

Abstract

Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.

Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

TL;DR

This work addresses data contamination concerns in LLM-based temporal prediction by testing whether prompted knowledge cutoffs can simulate earlier knowledge. It builds three evaluation subsets—Factual, Semantic, and Counterfactual—to probe forgetting direct facts, semantic shifts, and causally related content, respectively, using two meta-prompts across three LLMs. Results show strong effectiveness for direct factual forgetting (~) and semantic forgetting (~) but limited ability to erase causally linked content (~), underscoring the challenge of prompting-based unlearning for real-world temporal tasks. The study highlights the need for more robust evaluation methods and unlearning techniques, and provides datasets and code to support future research in fair and trustworthy temporal prediction with LLMs.

Abstract

Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.

Paper Structure

This paper contains 18 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Top: The LLM answers the user’s question using memorized knowledge. Bottom: The LLM does not use memorized knowledge to respond, given the prompted knowledge cutoff.
  • Figure 2: Distribution of data instances by year across the Factual, Semantic, and Counterfactual subsets.
  • Figure 3: Example of data in (a) Factual, (b) Semantic, and (c) Counterfactual subsets. Incorrect LLM responses use the real knowledge cutoff, while correct responses consider the simulated knowledge cutoff in the system prompt.
  • Figure 4: Unlearn success rate of three LLMs (DeepSeek-V3, LLaMA-3.1-405B, and GPT-4o) on three of our subsets (Factual, Semantic, and Counterfactual) using two different prompts (P1 and P2).
  • Figure 5: Distribution of three subsets by data category.
  • ...and 3 more figures