Table of Contents
Fetching ...

PAT-Questions: A Self-Updating Benchmark for Present-Anchored Temporal Question-Answering

Jannat Ara Meem, Muhammad Shihab Rashid, Yue Dong, Vagelis Hristidis

TL;DR

This work targets Present-Anchored Temporal QA (PATQA), addressing questions whose temporal validity is relative to the present. It introduces PAT-Questions, a self-updating benchmark of 6172 present-time-sensitive QAs derived from TEMPREASON templates and anchored to Wikidata, with automatic answer updates via SPARQL queries. The authors evaluate multiple LLMs and TEMPREASON-T5 under direct prompting and RAG, revealing substantial gaps in present-anchored and multi-hop temporal reasoning, even with up-to-date retrieval. The dataset's automatic updating mechanism and two-timestamp design enable robust, ongoing evaluation of PATQA methods, highlighting the need for new reasoning and grounding approaches in evolving knowledge bases.

Abstract

Existing work on Temporal Question Answering (TQA) has predominantly focused on questions anchored to specific timestamps or events (e.g. "Who was the US president in 1970?"). Little work has studied questions whose temporal context is relative to the present time (e.g. "Who was the previous US president?"). We refer to this problem as Present-Anchored Temporal QA (PATQA). PATQA poses unique challenges: (1) large language models (LLMs) may have outdated knowledge, (2) complex temporal relationships (e.g. 'before', 'previous') are hard to reason, (3) multi-hop reasoning may be required, and (4) the gold answers of benchmarks must be continuously updated. To address these challenges, we introduce the PAT-Questions benchmark, which includes single and multi-hop temporal questions. The answers in PAT-Questions can be automatically refreshed by re-running SPARQL queries on a knowledge graph, if available. We evaluate several state-of-the-art LLMs and a SOTA temporal reasoning model (TEMPREASON-T5) on PAT-Questions through direct prompting and retrieval-augmented generation (RAG). The results highlight the limitations of existing solutions in PATQA and motivate the need for new methods to improve PATQA reasoning capabilities.

PAT-Questions: A Self-Updating Benchmark for Present-Anchored Temporal Question-Answering

TL;DR

This work targets Present-Anchored Temporal QA (PATQA), addressing questions whose temporal validity is relative to the present. It introduces PAT-Questions, a self-updating benchmark of 6172 present-time-sensitive QAs derived from TEMPREASON templates and anchored to Wikidata, with automatic answer updates via SPARQL queries. The authors evaluate multiple LLMs and TEMPREASON-T5 under direct prompting and RAG, revealing substantial gaps in present-anchored and multi-hop temporal reasoning, even with up-to-date retrieval. The dataset's automatic updating mechanism and two-timestamp design enable robust, ongoing evaluation of PATQA methods, highlighting the need for new reasoning and grounding approaches in evolving knowledge bases.

Abstract

Existing work on Temporal Question Answering (TQA) has predominantly focused on questions anchored to specific timestamps or events (e.g. "Who was the US president in 1970?"). Little work has studied questions whose temporal context is relative to the present time (e.g. "Who was the previous US president?"). We refer to this problem as Present-Anchored Temporal QA (PATQA). PATQA poses unique challenges: (1) large language models (LLMs) may have outdated knowledge, (2) complex temporal relationships (e.g. 'before', 'previous') are hard to reason, (3) multi-hop reasoning may be required, and (4) the gold answers of benchmarks must be continuously updated. To address these challenges, we introduce the PAT-Questions benchmark, which includes single and multi-hop temporal questions. The answers in PAT-Questions can be automatically refreshed by re-running SPARQL queries on a knowledge graph, if available. We evaluate several state-of-the-art LLMs and a SOTA temporal reasoning model (TEMPREASON-T5) on PAT-Questions through direct prompting and retrieval-augmented generation (RAG). The results highlight the limitations of existing solutions in PATQA and motivate the need for new methods to improve PATQA reasoning capabilities.
Paper Structure (17 sections, 11 figures, 12 tables, 1 algorithm)

This paper contains 17 sections, 11 figures, 12 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of the limitations of the LLMs in answering the present-anchored temporal questions. The LLMs respond with an out-of-date answer (purple) due to knowledge outdating or a false information (red) due to lacking multi-hop PAT reasoning abilities.
  • Figure 2: Illustration of PAT-Questions dataset construction following Algorithm \ref{['alg:data_creation']}. Firstly, we modify the time-sensitive templates from the TEMPREASON dataset tan2023towards to build PAT-Questions templates, and following the steps shown in the figure, we create a set of one-hop and multi-hop PAT-Questions with annotated answers for two different timestamps, Dec 2021 and Dec 2023. Here, $\tau$ and $\alpha$ refer to a year and an entity respectively.
  • Figure 3: Illustration of automatic answer-updates to two multi-hop PAT-Questions via SPARQL templates
  • Figure 4: Error distribution of the incorrect LLM responses
  • Figure 5: Fig. (a) and (b) show the Wikidata relation distributions over PAT-Questions
  • ...and 6 more figures