RASTeR: Robust, Agentic, and Structured Temporal Reasoning
Dan Schumacher, Fatemeh Haji, Tara Grey, Niharika Bandlamudi, Nupoor Karnik, Gagana Uday Kumar, Jason Cho-Yu Chiang, Paul Rad, Nishant Vishwamitra, Anthony Rios
TL;DR
The paper tackles temporal question answering under noisy or outdated retrieved context by introducing RASTeR, an agentic prompting framework that separates context evaluation from answer generation and converts reliable context into a Temporal Knowledge Graph ($TKG$) to support structured, time-aware reasoning. It formalizes a multi-agent pipeline including relevance assessment, iterative $TKG$ construction, temporal correction, and $TKG$-based reasoning, with a fallback to parametric knowledge when context is unusable. Across four temporal QA datasets and multiple LLMs, RASTeR shows robust improvements in accuracy and F1, especially in adversarial and distractor-heavy settings, and maintains competitive performance when context is clean. The work highlights the value of explicit temporal grounding and modular reasoning for real-world TQA under imperfect retrieval conditions, and provides a detailed ablation and robustness analysis to illuminate strengths and limitations.
Abstract
Temporal question answering (TQA) remains a challenge for large language models (LLMs), particularly when retrieved content may be irrelevant, outdated, or temporally inconsistent. This is especially critical in applications like clinical event ordering, and policy tracking, which require reliable temporal reasoning even under noisy or outdated information. To address this challenge, we introduce RASTeR: \textbf{R}obust, \textbf{A}gentic, and \textbf{S}tructured, \textbf{Te}mporal \textbf{R}easoning, a prompting framework that separates context evaluation from answer generation. RASTeR first assesses the relevance and temporal coherence of the retrieved context, then constructs a temporal knolwedge graph (TKG) to better facilitate reasoning. When inconsistencies are detected, RASTeR selectively corrects or discards context before generating an answer. Across multiple datasets and LLMs, RASTeR consistently improves robustness\footnote{\ Some TQA work defines robustness as handling diverse temporal phenomena. Here, we define it as the ability to answer correctly despite suboptimal context}. We further validate our approach through a ``needle-in-the-haystack'' study, in which relevant context is buried among distractors. With forty distractors, RASTeR achieves 75\% accuracy, over 12\% ahead of the runner up
