Table of Contents
Fetching ...

Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

Soyeon Kim, Jindong Wang, Xing Xie, Steven Euijong Whang

TL;DR

TDBench is proposed, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques, such as temporal functional dependencies, temporal SQL, and temporal joins, to enable scalable and comprehensive TSQA evaluation while reducing the reliance on human labor.

Abstract

Facts change over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. Although factual Time-Sensitive Question-Answering (TSQA) tasks have been widely developed, existing benchmarks often face manual bottlenecks that limit scalable and comprehensive TSQA evaluation. To address this issue, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques, such as temporal functional dependencies, temporal SQL, and temporal joins. We also introduce a new evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy for a more fine-grained TSQA evaluation. Extensive experiments on contemporary LLMs show how TDBench enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing current TSQA evaluation approaches that largely center on Wikipedia/Wikidata by enabling LLM evaluation on application-specific data.

Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

TL;DR

TDBench is proposed, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques, such as temporal functional dependencies, temporal SQL, and temporal joins, to enable scalable and comprehensive TSQA evaluation while reducing the reliance on human labor.

Abstract

Facts change over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. Although factual Time-Sensitive Question-Answering (TSQA) tasks have been widely developed, existing benchmarks often face manual bottlenecks that limit scalable and comprehensive TSQA evaluation. To address this issue, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques, such as temporal functional dependencies, temporal SQL, and temporal joins. We also introduce a new evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy for a more fine-grained TSQA evaluation. Extensive experiments on contemporary LLMs show how TDBench enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing current TSQA evaluation approaches that largely center on Wikipedia/Wikidata by enabling LLM evaluation on application-specific data.

Paper Structure

This paper contains 66 sections, 4 equations, 5 figures, 29 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of TDBench framework. TDBench systematically constructs Time-Sensitive QA (TSQA) pairs by (1) selecting factual knowledge via temporal functional dependencies, (2) generating temporal SQL queries with diverse temporal contexts, and (3) converting queries into natural language QA pairs using an LLM and the database. During evaluation, TDBench automatically verifies both the final answer and time references in LLM responses, capturing cases where the model hallucinates in the explanation despite providing the correct answer. TDBench supports diverse TSQA scenarios, including temporal alignment, temporal reasoning, and implicit multi-hop questions.
  • Figure 2: LLM performances on 13 temporal relations in the open-book setting. The heatmap displays $\mathbf{AT}$ for each temporal relation, as defined in Table \ref{['tbl:relation_full']}.
  • Figure 3: LLM performances across different time spans (1985-2025) in a single domain (Countries) under the closed-book setting. Additional results aggregated over multiple domains are presented in Sec. \ref{['supp:exp_temp_span_agg']}.
  • Figure 4: Open-book vs. Closed-book QA in TDBench. The open-book setting provides table rows as additional context, while the closed-book setting provides only the question.
  • Figure 5: LLM performances across different time spans (1985-2025) in the closed-book setting, evaluated on multiple domains: Country, Athletes, and Organizations.

Theorems & Definitions (2)

  • Definition A.1: Temporal functional dependencies
  • Definition A.2: Temporal natural join