PAT-Questions: A Self-Updating Benchmark for Present-Anchored Temporal Question-Answering
Jannat Ara Meem, Muhammad Shihab Rashid, Yue Dong, Vagelis Hristidis
TL;DR
This work targets Present-Anchored Temporal QA (PATQA), addressing questions whose temporal validity is relative to the present. It introduces PAT-Questions, a self-updating benchmark of 6172 present-time-sensitive QAs derived from TEMPREASON templates and anchored to Wikidata, with automatic answer updates via SPARQL queries. The authors evaluate multiple LLMs and TEMPREASON-T5 under direct prompting and RAG, revealing substantial gaps in present-anchored and multi-hop temporal reasoning, even with up-to-date retrieval. The dataset's automatic updating mechanism and two-timestamp design enable robust, ongoing evaluation of PATQA methods, highlighting the need for new reasoning and grounding approaches in evolving knowledge bases.
Abstract
Existing work on Temporal Question Answering (TQA) has predominantly focused on questions anchored to specific timestamps or events (e.g. "Who was the US president in 1970?"). Little work has studied questions whose temporal context is relative to the present time (e.g. "Who was the previous US president?"). We refer to this problem as Present-Anchored Temporal QA (PATQA). PATQA poses unique challenges: (1) large language models (LLMs) may have outdated knowledge, (2) complex temporal relationships (e.g. 'before', 'previous') are hard to reason, (3) multi-hop reasoning may be required, and (4) the gold answers of benchmarks must be continuously updated. To address these challenges, we introduce the PAT-Questions benchmark, which includes single and multi-hop temporal questions. The answers in PAT-Questions can be automatically refreshed by re-running SPARQL queries on a knowledge graph, if available. We evaluate several state-of-the-art LLMs and a SOTA temporal reasoning model (TEMPREASON-T5) on PAT-Questions through direct prompting and retrieval-augmented generation (RAG). The results highlight the limitations of existing solutions in PATQA and motivate the need for new methods to improve PATQA reasoning capabilities.
