Table of Contents
Fetching ...

A Study into Investigating Temporal Robustness of LLMs

Jonas Wallat, Abdelrahman Abdallah, Adam Jatowt, Avishek Anand

TL;DR

This study investigates how robust large language models are to temporal reasoning tasks by proposing a suite of eight automatic time-related robustness tests and benchmarking six prominent models in zero-shot QA. The authors reveal consistent temporal brittleness, particularly to time relativization, removal, and reformulation, and they demonstrate that input-oriented reformulations can improve QA performance by up to about 55%. They also introduce an automatic robustness testing framework that uses test-question reformulations to gauge answer reliability on the fly, enabling trust calibration without ground-truth answers. The work highlights practical implications for deployment, suggesting prompts and question formulations that bolster temporal QA and outlining avenues for integrating retrieval and more robust temporal metrics in future work. Overall, the paper provides the first comprehensive temporal robustness benchmark for LLMs and offers actionable insights for improving temporal QA in real-world applications.

Abstract

Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. However, their performance on temporal questions and historical knowledge is limited because they often cannot understand temporal scope and orientation or neglect the temporal aspect altogether. In this study, we aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information and perform tasks requiring temporal reasoning and temporal factual knowledge. Specifically, we design eight time-sensitive robustness tests for factual information to check the sensitivity of six popular LLMs in the zero-shot setting. Overall, we find LLMs lacking temporal robustness, especially to temporal reformulations and the use of different granularities of temporal references. We show how a selection of these eight tests can be used automatically to judge a model's temporal robustness for user questions on the fly. Finally, we apply the findings of this study to improve the temporal QA performance by up to 55 percent.

A Study into Investigating Temporal Robustness of LLMs

TL;DR

This study investigates how robust large language models are to temporal reasoning tasks by proposing a suite of eight automatic time-related robustness tests and benchmarking six prominent models in zero-shot QA. The authors reveal consistent temporal brittleness, particularly to time relativization, removal, and reformulation, and they demonstrate that input-oriented reformulations can improve QA performance by up to about 55%. They also introduce an automatic robustness testing framework that uses test-question reformulations to gauge answer reliability on the fly, enabling trust calibration without ground-truth answers. The work highlights practical implications for deployment, suggesting prompts and question formulations that bolster temporal QA and outlining avenues for integrating retrieval and more robust temporal metrics in future work. Overall, the paper provides the first comprehensive temporal robustness benchmark for LLMs and offers actionable insights for improving temporal QA in real-world applications.

Abstract

Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. However, their performance on temporal questions and historical knowledge is limited because they often cannot understand temporal scope and orientation or neglect the temporal aspect altogether. In this study, we aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information and perform tasks requiring temporal reasoning and temporal factual knowledge. Specifically, we design eight time-sensitive robustness tests for factual information to check the sensitivity of six popular LLMs in the zero-shot setting. Overall, we find LLMs lacking temporal robustness, especially to temporal reformulations and the use of different granularities of temporal references. We show how a selection of these eight tests can be used automatically to judge a model's temporal robustness for user questions on the fly. Finally, we apply the findings of this study to improve the temporal QA performance by up to 55 percent.

Paper Structure

This paper contains 53 sections, 3 figures, 19 tables.

Figures (3)

  • Figure 1: We investigate the robustness of temporal understanding with a set of tests (here: temporal reversal). By asking the inverse question and looking for consistency between the two answers, we can study if the model understands the temporal-factual information.
  • Figure 2: Overview of the different tests in our robustness test suite for temporal factual QA. We suggest a suite of several tests that are useful in multiple applications: 1) helping to assess the temporal robustness of LLMs for temporal QA, 2) Calibrating user trust at inference time, and 3) as guidelines on how to reformulate arbitrary (temporal) questions to improve QA performance.
  • Figure 3: Difference between ground-truth and predicted years for Llama 3.1.