If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs
Lars Bungum, Charles Yijia Huang, Abeer Kashar
TL;DR
This paper reframes temporal reasoning as a historical reasoning task by testing LLMs on a 1940 Norwegian trivia book, prompting models to answer as if in 1940 and evaluating them in English and Norwegian. It introduces a three-step methodology—dataset creation from The Book, querying diverse LLM families, and LLM-as-judge evaluation—alongside a separate analysis of Norwegian-focused models, including NorwAI-Magistral24B. Findings show that model scale improves performance and English prompts generally yield better results, though a large Norwegian model can match or exceed some international models, highlighting both the potential and limits of TR in current LLMs. The work advances TR evaluation for non-English contexts, underscores challenges from semantic drift and data leakage, and points toward future directions like retrieval augmentation and broader multilingual datasets to deepen understanding of temporal reasoning in LLMs.
Abstract
In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.
