SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models
Zhongjian Miao, Hao Fu, Chen Wei
TL;DR
SPAN defines a dynamic, cross-calendar temporal reasoning benchmark that requires intra-calendar reasoning plus inter-calendar date conversion across six calendars and ten reasoning directions. It uses a template-driven evaluation protocol that instantiates questions from a user-specified Gregorian date to mitigate data contamination and time-invariance. Across 21 evaluation dates (1960–2060), state-of-the-art LLMs achieve only $34.5\%$ average accuracy, revealing significant challenges such as Future-Date Degradation and Calendar Asymmetry Bias. To address this, the authors introduce TimeAgent, a tool-augmented code-generation approach that leverages a cross-calendar conversion interface, achieving $95.31\%$ average accuracy and substantially outperforming baselines. The work highlights the potential of combining LLM reasoning with external tools to advance temporally and culturally adaptive cross-calendar reasoning, and sets a direction for broader calendar coverage and task extensions.
Abstract
We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation that enables assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. To strengthen LLMs' cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.
