Table of Contents
Fetching ...

Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties

Zhenglin Wang, Jialong Wu, Pengfei LI, Yong Jiang, Deyu Zhou

TL;DR

The paper presents CTM, a culturally grounded benchmark for temporal reasoning across Chinese dynasties, addressing limitations of rule-based English benchmarks by emphasizing context, cross-entity reasoning, and temporal alignment. CTM comprises 8,750 QA pairs and 60 Timeline Ito Games derived from a repository of 4,700+ entities, spanning ten major dynastic periods and sourced from multiple authoritative references. Extensive experiments across twelve LLM backbones reveal that entity count, temporal granularity, and cross-dimensional alignment pose significant challenges, with chain-of-thought helping some models but not universally, and open-book retrieval yielding moderate gains. The work highlights directions for improved pretraining, knowledge integration, and reasoning mechanisms, and positions CTM as a valuable resource for advancing culturally grounded temporal reasoning in large language models.

Abstract

Temporal reasoning is fundamental to human cognition and is crucial for various real-world applications. While recent advances in Large Language Models have demonstrated promising capabilities in temporal reasoning, existing benchmarks primarily rely on rule-based construction, lack contextual depth, and involve a limited range of temporal entities. To address these limitations, we introduce Chinese Time Reasoning (CTM), a benchmark designed to evaluate LLMs on temporal reasoning within the extensive scope of Chinese dynastic chronology. CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning, providing a comprehensive evaluation. Extensive experimental results reveal the challenges posed by CTM and highlight potential avenues for improvement.

Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties

TL;DR

The paper presents CTM, a culturally grounded benchmark for temporal reasoning across Chinese dynasties, addressing limitations of rule-based English benchmarks by emphasizing context, cross-entity reasoning, and temporal alignment. CTM comprises 8,750 QA pairs and 60 Timeline Ito Games derived from a repository of 4,700+ entities, spanning ten major dynastic periods and sourced from multiple authoritative references. Extensive experiments across twelve LLM backbones reveal that entity count, temporal granularity, and cross-dimensional alignment pose significant challenges, with chain-of-thought helping some models but not universally, and open-book retrieval yielding moderate gains. The work highlights directions for improved pretraining, knowledge integration, and reasoning mechanisms, and positions CTM as a valuable resource for advancing culturally grounded temporal reasoning in large language models.

Abstract

Temporal reasoning is fundamental to human cognition and is crucial for various real-world applications. While recent advances in Large Language Models have demonstrated promising capabilities in temporal reasoning, existing benchmarks primarily rely on rule-based construction, lack contextual depth, and involve a limited range of temporal entities. To address these limitations, we introduce Chinese Time Reasoning (CTM), a benchmark designed to evaluate LLMs on temporal reasoning within the extensive scope of Chinese dynastic chronology. CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning, providing a comprehensive evaluation. Extensive experimental results reveal the challenges posed by CTM and highlight potential avenues for improvement.

Paper Structure

This paper contains 34 sections, 25 figures, 6 tables.

Figures (25)

  • Figure 1: A QA pair from a script error correction task and an instance of the Timeline Ito Game with a "fruit size" theme from CTM.
  • Figure 2: Statistic of CTM.
  • Figure 3: Average performance of Ito's Guessing Game. Detailed results can be found in Appendix \ref{['app:ito_acc']}.
  • Figure 4: Accuracy across entity inter-dynastic intervals under direct prompting setting. The detailed results are shown in Figure \ref{['fig:acc_span_cot']}, Figure \ref{['fig:line_direct']} and Figure \ref{['fig:line_cot']}.
  • Figure 5: Performance in the close-book and open-book settings. Detailed results can be found in App. \ref{['app:openbook']}.
  • ...and 20 more figures