Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties
Zhenglin Wang, Jialong Wu, Pengfei LI, Yong Jiang, Deyu Zhou
TL;DR
The paper presents CTM, a culturally grounded benchmark for temporal reasoning across Chinese dynasties, addressing limitations of rule-based English benchmarks by emphasizing context, cross-entity reasoning, and temporal alignment. CTM comprises 8,750 QA pairs and 60 Timeline Ito Games derived from a repository of 4,700+ entities, spanning ten major dynastic periods and sourced from multiple authoritative references. Extensive experiments across twelve LLM backbones reveal that entity count, temporal granularity, and cross-dimensional alignment pose significant challenges, with chain-of-thought helping some models but not universally, and open-book retrieval yielding moderate gains. The work highlights directions for improved pretraining, knowledge integration, and reasoning mechanisms, and positions CTM as a valuable resource for advancing culturally grounded temporal reasoning in large language models.
Abstract
Temporal reasoning is fundamental to human cognition and is crucial for various real-world applications. While recent advances in Large Language Models have demonstrated promising capabilities in temporal reasoning, existing benchmarks primarily rely on rule-based construction, lack contextual depth, and involve a limited range of temporal entities. To address these limitations, we introduce Chinese Time Reasoning (CTM), a benchmark designed to evaluate LLMs on temporal reasoning within the extensive scope of Chinese dynastic chronology. CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning, providing a comprehensive evaluation. Extensive experimental results reveal the challenges posed by CTM and highlight potential avenues for improvement.
