Table of Contents
Fetching ...

Unveiling Divergent Inductive Biases of LLMs on Temporal Data

Sindhu Kishore, Hangfeng He

TL;DR

This work probes how large language models encode temporal knowledge by comparing GPT-3.5 and GPT-4 across two prompt regimes (QA and TE) and two data types (implicit and explicit events). Using datasets from TimeBank, TempEval, AQUAINT, and TRACIE, it quantifies biases in temporal relations (BEFORE/AFTER) and truth judgments (TRUE/FALSE), and examines consistency under relation reversals. The principal findings reveal model-dependent biases: GPT-3.5 tends to AFTER and TRUE in QA and TE, respectively, while GPT-4 tends toward BEFORE and FALSE, with TE-consistent vs TE-inconsistent patterns exhibiting different bias profiles. These results suggest that model evolution does not automatically reduce temporal biases and highlight the need for targeted temporal reasoning benchmarks and prompting strategies to foster robust inductive reasoning in LLMs.

Abstract

Unraveling the intricate details of events in natural language necessitates a subtle understanding of temporal dynamics. Despite the adeptness of Large Language Models (LLMs) in discerning patterns and relationships from data, their inherent comprehension of temporal dynamics remains a formidable challenge. This research meticulously explores these intrinsic challenges within LLMs, with a specific emphasis on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data. Employing two distinct prompt types, namely Question Answering (QA) format and Textual Entailment (TE) format, our analysis probes into both implicit and explicit events. The findings underscore noteworthy trends, revealing disparities in the performance of GPT-3.5 and GPT-4. Notably, biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for "AFTER'' in the QA format for both implicit and explicit events, while GPT-4 leans towards "BEFORE''. Furthermore, a consistent pattern surfaces wherein GPT-3.5 tends towards "TRUE'', and GPT-4 exhibits a preference for "FALSE'' in the TE format for both implicit and explicit events. This persistent discrepancy between GPT-3.5 and GPT-4 in handling temporal data highlights the intricate nature of inductive bias in LLMs, suggesting that the evolution of these models may not merely mitigate bias but may introduce new layers of complexity.

Unveiling Divergent Inductive Biases of LLMs on Temporal Data

TL;DR

This work probes how large language models encode temporal knowledge by comparing GPT-3.5 and GPT-4 across two prompt regimes (QA and TE) and two data types (implicit and explicit events). Using datasets from TimeBank, TempEval, AQUAINT, and TRACIE, it quantifies biases in temporal relations (BEFORE/AFTER) and truth judgments (TRUE/FALSE), and examines consistency under relation reversals. The principal findings reveal model-dependent biases: GPT-3.5 tends to AFTER and TRUE in QA and TE, respectively, while GPT-4 tends toward BEFORE and FALSE, with TE-consistent vs TE-inconsistent patterns exhibiting different bias profiles. These results suggest that model evolution does not automatically reduce temporal biases and highlight the need for targeted temporal reasoning benchmarks and prompting strategies to foster robust inductive reasoning in LLMs.

Abstract

Unraveling the intricate details of events in natural language necessitates a subtle understanding of temporal dynamics. Despite the adeptness of Large Language Models (LLMs) in discerning patterns and relationships from data, their inherent comprehension of temporal dynamics remains a formidable challenge. This research meticulously explores these intrinsic challenges within LLMs, with a specific emphasis on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data. Employing two distinct prompt types, namely Question Answering (QA) format and Textual Entailment (TE) format, our analysis probes into both implicit and explicit events. The findings underscore noteworthy trends, revealing disparities in the performance of GPT-3.5 and GPT-4. Notably, biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for "AFTER'' in the QA format for both implicit and explicit events, while GPT-4 leans towards "BEFORE''. Furthermore, a consistent pattern surfaces wherein GPT-3.5 tends towards "TRUE'', and GPT-4 exhibits a preference for "FALSE'' in the TE format for both implicit and explicit events. This persistent discrepancy between GPT-3.5 and GPT-4 in handling temporal data highlights the intricate nature of inductive bias in LLMs, suggesting that the evolution of these models may not merely mitigate bias but may introduce new layers of complexity.
Paper Structure (21 sections, 16 figures, 3 tables)

This paper contains 21 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Inductive bias in OpenAI LLMs: GPT-4 exhibits a preference for BEFORE and FALSE, while GPT-3.5 tends to favor AFTER and TRUE.
  • Figure 2: Template and Examples of QA and TE prompts for implicit & explicit events.
  • Figure 3: GPT-3.5 biased towards AFTER and GPT-4 biased towards BEFORE in QA.
  • Figure 4: GPT-3.5 biased towards TRUE and GPT-4 biased towards FALSE in TE -Inconsistent pair.
  • Figure 5: Consistency in Response for implicit events
  • ...and 11 more figures