Table of Contents
Fetching ...

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

Zhongjian Miao, Hao Fu, Chen Wei

TL;DR

SPAN defines a dynamic, cross-calendar temporal reasoning benchmark that requires intra-calendar reasoning plus inter-calendar date conversion across six calendars and ten reasoning directions. It uses a template-driven evaluation protocol that instantiates questions from a user-specified Gregorian date to mitigate data contamination and time-invariance. Across 21 evaluation dates (1960–2060), state-of-the-art LLMs achieve only $34.5\%$ average accuracy, revealing significant challenges such as Future-Date Degradation and Calendar Asymmetry Bias. To address this, the authors introduce TimeAgent, a tool-augmented code-generation approach that leverages a cross-calendar conversion interface, achieving $95.31\%$ average accuracy and substantially outperforming baselines. The work highlights the potential of combining LLM reasoning with external tools to advance temporally and culturally adaptive cross-calendar reasoning, and sets a direction for broader calendar coverage and task extensions.

Abstract

We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation that enables assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. To strengthen LLMs' cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

TL;DR

SPAN defines a dynamic, cross-calendar temporal reasoning benchmark that requires intra-calendar reasoning plus inter-calendar date conversion across six calendars and ten reasoning directions. It uses a template-driven evaluation protocol that instantiates questions from a user-specified Gregorian date to mitigate data contamination and time-invariance. Across 21 evaluation dates (1960–2060), state-of-the-art LLMs achieve only average accuracy, revealing significant challenges such as Future-Date Degradation and Calendar Asymmetry Bias. To address this, the authors introduce TimeAgent, a tool-augmented code-generation approach that leverages a cross-calendar conversion interface, achieving average accuracy and substantially outperforming baselines. The work highlights the potential of combining LLM reasoning with external tools to advance temporally and culturally adaptive cross-calendar reasoning, and sets a direction for broader calendar coverage and task extensions.

Abstract

We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation that enables assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. To strengthen LLMs' cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.

Paper Structure

This paper contains 35 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of the proposed evaluation protocol. Given a user-specified Gregorian date as input, the process proceeds through four stages: ➀ Calendar Conversion. The Gregorian date is converted into its equivalents in five calendars via our search_calendar interface, yielding $(c_s, d_{c_s}^r, f_{c_s})$ pairs, with $d_{c_s}^r$ and $f_{c_s}$ denoting the reference date and the festival in the source calendar $c_s$, respectively. ➁ Template Matching. These pairs are further utilized to construct $(c_s, d_{c_s}^r, f_{c_s}, c_t)$ pairs. Here, $c_s$ and $c_t$ are selected specifically to ensure one is a Gregorian calendar and the other is a non-Gregorian calendar. These pairs are matched against all question–code template pairs to generate candidate pairs. ➂ Template Instantiation. For each candidate question-code template pair, we manually specify the remaining variables $(d_{c_t}^e, n_d, n_w, n_y)$. Afterwards, question–code pairs are generated by filling the template placeholders with all variables. ➃ Code Execution. Finally, we execute each code snippet to generate the gold answer.
  • Figure 2: Left: Accuracy of LLMs across evaluation dates ranging from July $1$st, $1960$ to July $1$st, $2060$ at five-year intervals (July $1$st omitted for clarity). The average accuracy over time is annotated for each model. Right: Average output token counts of LLMs at sampled evaluation dates. To ensure comparability, model outputs are tokenized using OpenAI’s tiktoken tokenizer with the o200k_base encoding.
  • Figure 3: Accuracy of date-based and festival-based cross-calendar temporal reasoning over the evaluation dates from July $1$st, $1960$ to July $1$st, $2060$ at five-year intervals (July $1$st omitted for clarity). The average accuracy over time for each reasoning type is annotated.
  • Figure 4: Accuracy of content question and polar question over the evaluation dates from July $1$st, $1960$ to July $1$st, $2060$ at five-year intervals (July $1$st omitted for clarity). The average accuracy over time for each question format is annotated.
  • Figure 5: Accuracy of Gregorian-to-Others and Others-to-Gregorian cross-calendar temporal reasoning over the evaluation dates from July $1$st, $1960$ to July $1$st, $2060$ at five-year intervals (July $1$st omitted for clarity). The average accuracy over time for each group's temporal reasoning directions is annotated.
  • ...and 1 more figures