Table of Contents
Fetching ...

Temporal Blind Spots in Large Language Models

Jonas Wallat, Adam Jatowt, Avishek Anand

TL;DR

This work evaluates temporal understanding in instruction-tuned LLMs using three temporal QA benchmarks (TemporalQuestions, ArchivalQA, TempLAMA) to reveal temporal blind spots. It compares multiple models, analyzes absolute vs relative time references, and examines the effects of corrupting or removing time cues. Key findings show limited past knowledge, a bias toward recent information up to a point, strong difficulty with relative time references, and several types of temporal errors including shifts, inertia, and referencing mistakes. The study highlights the need for temporally grounded training data and evaluation methodologies to improve LLMs' performance on temporally-oriented tasks and suggests directions for future model development.

Abstract

Large language models (LLMs) have recently gained significant attention due to their unparalleled ability to perform various natural language processing tasks. These models, benefiting from their advanced natural language understanding capabilities, have demonstrated impressive zero-shot performance. However, the pre-training data utilized in LLMs is often confined to a specific corpus, resulting in inherent freshness and temporal scope limitations. Consequently, this raises concerns regarding the effectiveness of LLMs for tasks involving temporal intents. In this study, we aim to investigate the underlying limitations of general-purpose LLMs when deployed for tasks that require a temporal understanding. We pay particular attention to handling factual temporal knowledge through three popular temporal QA datasets. Specifically, we observe low performance on detailed questions about the past and, surprisingly, for rather new information. In manual and automatic testing, we find multiple temporal errors and characterize the conditions under which QA performance deteriorates. Our analysis contributes to understanding LLM limitations and offers valuable insights into developing future models that can better cater to the demands of temporally-oriented tasks. The code is available\footnote{https://github.com/jwallat/temporalblindspots}.

Temporal Blind Spots in Large Language Models

TL;DR

This work evaluates temporal understanding in instruction-tuned LLMs using three temporal QA benchmarks (TemporalQuestions, ArchivalQA, TempLAMA) to reveal temporal blind spots. It compares multiple models, analyzes absolute vs relative time references, and examines the effects of corrupting or removing time cues. Key findings show limited past knowledge, a bias toward recent information up to a point, strong difficulty with relative time references, and several types of temporal errors including shifts, inertia, and referencing mistakes. The study highlights the need for temporally grounded training data and evaluation methodologies to improve LLMs' performance on temporally-oriented tasks and suggests directions for future model development.

Abstract

Large language models (LLMs) have recently gained significant attention due to their unparalleled ability to perform various natural language processing tasks. These models, benefiting from their advanced natural language understanding capabilities, have demonstrated impressive zero-shot performance. However, the pre-training data utilized in LLMs is often confined to a specific corpus, resulting in inherent freshness and temporal scope limitations. Consequently, this raises concerns regarding the effectiveness of LLMs for tasks involving temporal intents. In this study, we aim to investigate the underlying limitations of general-purpose LLMs when deployed for tasks that require a temporal understanding. We pay particular attention to handling factual temporal knowledge through three popular temporal QA datasets. Specifically, we observe low performance on detailed questions about the past and, surprisingly, for rather new information. In manual and automatic testing, we find multiple temporal errors and characterize the conditions under which QA performance deteriorates. Our analysis contributes to understanding LLM limitations and offers valuable insights into developing future models that can better cater to the demands of temporally-oriented tasks. The code is available\footnote{https://github.com/jwallat/temporalblindspots}.
Paper Structure (42 sections, 9 figures, 7 tables)

This paper contains 42 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Stratification of the alapaca-7B, red-pajama-7B, and text-davinci-003 models on the ArchivalQA and TempLAMA datasets. Stratified by years, the trendline is the moving average with a window of 2. We do not show plots for the TemporalQuestions dataset since the dataset is not large enough for computing individual results per year.
  • Figure 2: Relative and absolute time referencing.
  • Figure 3: Effect of randomized relative and absolute time references. Textured bars show the randomized variants.
  • Figure 4: Manual analysis of 100 of alapaca-7B's and text-davinci-003's wrong answers from the ArchivalQA dataset.
  • Figure 5: The effect of corrupted time of the alapaca-7B model. The relative time reference is calculated via a reference year (2021). Off-by-X means that the year is X years apart from the correct year. No time refers to the effect of entirely removing the time reference from the questions.
  • ...and 4 more figures