Table of Contents
Fetching ...

ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models

Yachuan Liu, Xiaochun Wei, Lin Shi, Xinnuo Li, Bohan Zhang, Paramveer Dhillon, Qiaozhu Mei

TL;DR

ExAnte defines and benchmarks ex-ante inference for LLMs, focusing on preventing leakage of post-cutoff knowledge. It introduces four datasets (Stock, QA, Wikipedia, Publication) and a leakage rate metric to quantify adherence to pre-cutoff information, alongside a task-specific quality measure. Across multiple models and prompting strategies, results show persistent temporal leakage, with variability by task and a strong influence from cutoff gaps and memorization. The work provides a structured framework and dataset for advancing temporal reasoning in time-sensitive applications and highlights the need for architectural and training innovations beyond prompting. The benchmark thus establishes a baseline and motivates future methods to improve reliability in ex-ante LLM reasoning.

Abstract

Large language models (LLMs) face significant challenges in ex-ante reasoning, where analysis, inference, or predictions must be made without access to information from future events. Even with explicit prompts enforcing temporal cutoffs, LLMs often generate outputs influenced by internalized knowledge of events beyond the specified cutoff. This paper introduces a novel task and benchmark designed to evaluate the ability of LLMs to reason while adhering to such temporal constraints. The benchmark includes a variety of tasks: stock prediction, Wikipedia event prediction, scientific publication prediction, and Question Answering (QA), designed to assess factual knowledge under temporal cutoff constraints. We use leakage rate to quantify models' reliance on future information beyond cutoff timestamps. Experimental results reveal that LLMs struggle to consistently adhere to temporal cutoffs across common prompting strategies and tasks, demonstrating persistent challenges in ex-ante reasoning. This benchmark provides a potential evaluation framework to advance the development of LLMs' temporal reasoning ability for time-sensitive applications.

ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models

TL;DR

ExAnte defines and benchmarks ex-ante inference for LLMs, focusing on preventing leakage of post-cutoff knowledge. It introduces four datasets (Stock, QA, Wikipedia, Publication) and a leakage rate metric to quantify adherence to pre-cutoff information, alongside a task-specific quality measure. Across multiple models and prompting strategies, results show persistent temporal leakage, with variability by task and a strong influence from cutoff gaps and memorization. The work provides a structured framework and dataset for advancing temporal reasoning in time-sensitive applications and highlights the need for architectural and training innovations beyond prompting. The benchmark thus establishes a baseline and motivates future methods to improve reliability in ex-ante LLM reasoning.

Abstract

Large language models (LLMs) face significant challenges in ex-ante reasoning, where analysis, inference, or predictions must be made without access to information from future events. Even with explicit prompts enforcing temporal cutoffs, LLMs often generate outputs influenced by internalized knowledge of events beyond the specified cutoff. This paper introduces a novel task and benchmark designed to evaluate the ability of LLMs to reason while adhering to such temporal constraints. The benchmark includes a variety of tasks: stock prediction, Wikipedia event prediction, scientific publication prediction, and Question Answering (QA), designed to assess factual knowledge under temporal cutoff constraints. We use leakage rate to quantify models' reliance on future information beyond cutoff timestamps. Experimental results reveal that LLMs struggle to consistently adhere to temporal cutoffs across common prompting strategies and tasks, demonstrating persistent challenges in ex-ante reasoning. This benchmark provides a potential evaluation framework to advance the development of LLMs' temporal reasoning ability for time-sensitive applications.

Paper Structure

This paper contains 59 sections, 6 equations, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Illustration of temporal reasoning and benchmark task structure.
  • Figure 2: GPT-4o's historical stock price memorization pattern for AAPL. The blue line represents model-predicted prices while the red dashed line shows the ground truth historical prices. The plot demonstrates significantly improved memorization accuracy post-2021, forming a natural temporal boundary for our ExAnte analysis.