Table of Contents
Fetching ...

DATETIME: A new benchmark to measure LLM translation and reasoning capabilities

Edward Gaere, Florian Wangenheim

TL;DR

The paper introduces DATETIME, a publicly available benchmark for evaluating LLM translation and reasoning over datetimes, addressing a gap where datetimes present human-intuitive yet challenging tasks for machines. It organizes tasks into Translation, Computation, and Mixed categories, using synthetic data across a wide temporal range to test translation and multi-step arithmetic. Empirical results across 58 models reveal substantial performance dispersion, with frontier models showing datetime reasoning abilities while open-source models lag, especially on complex computations like Add-250. The work emphasizes reproducibility, proposes future research directions (prompting, fine-tuning, and programmatic approaches), and provides a foundation for systematic datetime evaluation in open benchmarks and future AI systems.

Abstract

This paper introduces DATETIME, a new high-quality benchmark designed to evaluate the translation and reasoning abilities of a Large Language Model (LLM) on datetimes. A datetime is simply a date and a time, for example '11th.february.2023 ,1:12:31'. Datetimes are an interesting domain because they are intuitive and straightforward for humans to process but present significant challenges for LLMs. At the time of writing, no publicly available benchmark exists for systematically evaluating LLMs on datetime processing. Our experiments show that state-of-the-art models exhibit significant difficulty with tasks involving reasoning on datetimes, and that General Artificial Intelligence is still a distant aspiration. We hypothesize that working with datetimes necessitates translation and/or computation capabilities, and the tasks of the benchmark are organized accordingly. Significant dispersion in performance across models is observed with surprisingly poor performance even on apparently trivial tasks. Whilst frontier models such as ChatGPT, Claude and Llama3.1 have evidently been built and trained with datetime reasoning abilities, significant improvement is required for the open-source models.

DATETIME: A new benchmark to measure LLM translation and reasoning capabilities

TL;DR

The paper introduces DATETIME, a publicly available benchmark for evaluating LLM translation and reasoning over datetimes, addressing a gap where datetimes present human-intuitive yet challenging tasks for machines. It organizes tasks into Translation, Computation, and Mixed categories, using synthetic data across a wide temporal range to test translation and multi-step arithmetic. Empirical results across 58 models reveal substantial performance dispersion, with frontier models showing datetime reasoning abilities while open-source models lag, especially on complex computations like Add-250. The work emphasizes reproducibility, proposes future research directions (prompting, fine-tuning, and programmatic approaches), and provides a foundation for systematic datetime evaluation in open benchmarks and future AI systems.

Abstract

This paper introduces DATETIME, a new high-quality benchmark designed to evaluate the translation and reasoning abilities of a Large Language Model (LLM) on datetimes. A datetime is simply a date and a time, for example '11th.february.2023 ,1:12:31'. Datetimes are an interesting domain because they are intuitive and straightforward for humans to process but present significant challenges for LLMs. At the time of writing, no publicly available benchmark exists for systematically evaluating LLMs on datetime processing. Our experiments show that state-of-the-art models exhibit significant difficulty with tasks involving reasoning on datetimes, and that General Artificial Intelligence is still a distant aspiration. We hypothesize that working with datetimes necessitates translation and/or computation capabilities, and the tasks of the benchmark are organized accordingly. Significant dispersion in performance across models is observed with surprisingly poor performance even on apparently trivial tasks. Whilst frontier models such as ChatGPT, Claude and Llama3.1 have evidently been built and trained with datetime reasoning abilities, significant improvement is required for the open-source models.

Paper Structure

This paper contains 69 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The Translation ISO-8601 task requires the translation of a datetime from it's natural representation to it's ISO-8601 representation.
  • Figure 2: The Computation Add-20 task requires adding 20 days to a datetime provided in ISO-8601 representation, and producing a new ISO-8601 datetime with the result.
  • Figure 3: Prompt, system prompt, output and evaluation for the ISO-8601 Translation task.
  • Figure 4: Prompt, system prompt, output and evaluation for the Add-20 Computation task.
  • Figure 5: Prompt, system prompt, output and evaluation for the Add-20 Mixed task.
  • ...and 2 more figures