A Study into Investigating Temporal Robustness of LLMs
Jonas Wallat, Abdelrahman Abdallah, Adam Jatowt, Avishek Anand
TL;DR
This study investigates how robust large language models are to temporal reasoning tasks by proposing a suite of eight automatic time-related robustness tests and benchmarking six prominent models in zero-shot QA. The authors reveal consistent temporal brittleness, particularly to time relativization, removal, and reformulation, and they demonstrate that input-oriented reformulations can improve QA performance by up to about 55%. They also introduce an automatic robustness testing framework that uses test-question reformulations to gauge answer reliability on the fly, enabling trust calibration without ground-truth answers. The work highlights practical implications for deployment, suggesting prompts and question formulations that bolster temporal QA and outlining avenues for integrating retrieval and more robust temporal metrics in future work. Overall, the paper provides the first comprehensive temporal robustness benchmark for LLMs and offers actionable insights for improving temporal QA in real-world applications.
Abstract
Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. However, their performance on temporal questions and historical knowledge is limited because they often cannot understand temporal scope and orientation or neglect the temporal aspect altogether. In this study, we aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information and perform tasks requiring temporal reasoning and temporal factual knowledge. Specifically, we design eight time-sensitive robustness tests for factual information to check the sensitivity of six popular LLMs in the zero-shot setting. Overall, we find LLMs lacking temporal robustness, especially to temporal reformulations and the use of different granularities of temporal references. We show how a selection of these eight tests can be used automatically to judge a model's temporal robustness for user questions on the fly. Finally, we apply the findings of this study to improve the temporal QA performance by up to 55 percent.
