RASTeR: Robust, Agentic, and Structured Temporal Reasoning

Dan Schumacher; Fatemeh Haji; Tara Grey; Niharika Bandlamudi; Nupoor Karnik; Gagana Uday Kumar; Jason Cho-Yu Chiang; Paul Rad; Nishant Vishwamitra; Anthony Rios

RASTeR: Robust, Agentic, and Structured Temporal Reasoning

Dan Schumacher, Fatemeh Haji, Tara Grey, Niharika Bandlamudi, Nupoor Karnik, Gagana Uday Kumar, Jason Cho-Yu Chiang, Paul Rad, Nishant Vishwamitra, Anthony Rios

TL;DR

The paper tackles temporal question answering under noisy or outdated retrieved context by introducing RASTeR, an agentic prompting framework that separates context evaluation from answer generation and converts reliable context into a Temporal Knowledge Graph ($TKG$) to support structured, time-aware reasoning. It formalizes a multi-agent pipeline including relevance assessment, iterative $TKG$ construction, temporal correction, and $TKG$-based reasoning, with a fallback to parametric knowledge when context is unusable. Across four temporal QA datasets and multiple LLMs, RASTeR shows robust improvements in accuracy and F1, especially in adversarial and distractor-heavy settings, and maintains competitive performance when context is clean. The work highlights the value of explicit temporal grounding and modular reasoning for real-world TQA under imperfect retrieval conditions, and provides a detailed ablation and robustness analysis to illuminate strengths and limitations.

Abstract

Temporal question answering (TQA) remains a challenge for large language models (LLMs), particularly when retrieved content may be irrelevant, outdated, or temporally inconsistent. This is especially critical in applications like clinical event ordering, and policy tracking, which require reliable temporal reasoning even under noisy or outdated information. To address this challenge, we introduce RASTeR: \textbf{R}obust, \textbf{A}gentic, and \textbf{S}tructured, \textbf{Te}mporal \textbf{R}easoning, a prompting framework that separates context evaluation from answer generation. RASTeR first assesses the relevance and temporal coherence of the retrieved context, then constructs a temporal knolwedge graph (TKG) to better facilitate reasoning. When inconsistencies are detected, RASTeR selectively corrects or discards context before generating an answer. Across multiple datasets and LLMs, RASTeR consistently improves robustness\footnote{\ Some TQA work defines robustness as handling diverse temporal phenomena. Here, we define it as the ability to answer correctly despite suboptimal context}. We further validate our approach through a ``needle-in-the-haystack'' study, in which relevant context is buried among distractors. With forty distractors, RASTeR achieves 75\% accuracy, over 12\% ahead of the runner up

RASTeR: Robust, Agentic, and Structured Temporal Reasoning

TL;DR

) to support structured, time-aware reasoning. It formalizes a multi-agent pipeline including relevance assessment, iterative

construction, temporal correction, and

-based reasoning, with a fallback to parametric knowledge when context is unusable. Across four temporal QA datasets and multiple LLMs, RASTeR shows robust improvements in accuracy and F1, especially in adversarial and distractor-heavy settings, and maintains competitive performance when context is clean. The work highlights the value of explicit temporal grounding and modular reasoning for real-world TQA under imperfect retrieval conditions, and provides a detailed ablation and robustness analysis to illuminate strengths and limitations.

Abstract

Paper Structure (18 sections, 8 equations, 13 figures, 15 tables)

This paper contains 18 sections, 8 equations, 13 figures, 15 tables.

Introduction
Related Work
Method
Experiments
Results
Conclusion
Appendix
Metric Formalization
Descriptive Statistics
Estimated Token Costs
Expanded Results
Significance Testing
Using Semantically Similar Context
Error Analysis
Ordering Dates
...and 3 more sections

Figures (13)

Figure 1: Example of TQA failure due to irrelevant context. The retrieved statement is outdated, leading to an incorrect answer. RASTeR detects the inconsistency and defaults to parametric knowledge. Additional experiments examine other context imperfections (e.g. partially incorrect, fully irrelevant context).
Figure 2: Overview of the RASTeR framework. Given a question and retrieved context, the system first determines whether the context is relevant and temporally coherent. If necessary, it corrects temporal inconsistencies before generating a structured TKG. The final answer is produced either by reasoning over the TKG or, in cases of irrelevant or missing context, via a fallback zero-shot reasoner.
Figure 3: GPT accuracy as the number of distractors (irrelevant contexts) increases around a single relevant passage. All contexts have a relevant passage.
Figure 4: An example where the model output is semantically correct but fails EM and Acc.
Figure 5: System and user determining the relevance of a provided context.
...and 8 more figures

RASTeR: Robust, Agentic, and Structured Temporal Reasoning

TL;DR

Abstract

RASTeR: Robust, Agentic, and Structured Temporal Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)