Table of Contents
Fetching ...

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

Gagan Bhatia, MingZe Tang, Cristina Mahanta, Madiha Kazi

TL;DR

DateLogicQA introduces a 190-question benchmark to probe LLMs' handling of dates across diverse formats and contexts. The authors define the Semantic Integrity Metric to evaluate tokenization quality and identify Representation-Level and Logical-Level biases that affect embeddings and reasoning. Through a human-led evaluation of 12 state-of-the-art LLMs, the study reveals size- and format- dependent strengths and weaknesses in temporal reasoning. The findings motivate improvements via temporally balanced pretraining, post-training fine-tuning, retrieval-augmented reasoning, and prompting strategies, highlighting practical steps to mitigate temporal biases in real-world date-centric tasks.

Abstract

This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately.

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

TL;DR

DateLogicQA introduces a 190-question benchmark to probe LLMs' handling of dates across diverse formats and contexts. The authors define the Semantic Integrity Metric to evaluate tokenization quality and identify Representation-Level and Logical-Level biases that affect embeddings and reasoning. Through a human-led evaluation of 12 state-of-the-art LLMs, the study reveals size- and format- dependent strengths and weaknesses in temporal reasoning. The findings motivate improvements via temporally balanced pretraining, post-training fine-tuning, retrieval-augmented reasoning, and prompting strategies, highlighting practical steps to mitigate temporal biases in real-world date-centric tasks.

Abstract

This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately.

Paper Structure

This paper contains 19 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Examples of temporal biases in LLMs. Incorrect Response, Faulty Date but accurate reasoning indicating representation level temporal bias, Faulty reasoning but accurate date indicating logical level temporal bias, Correct response
  • Figure 2: Human evaluation rubric
  • Figure 3: Results Visualisations
  • Figure 4: Each bar is segmented into four colors representing the quality of responses: Incorrect Response, Faulty Date but accurate reasoning indicating representation level temporal bias, Faulty reasoning but accurate date indicating logical level temporal bias, Correct response
  • Figure 5: Correlation plot between semantic integrity score against token count