Table of Contents
Fetching ...

If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs

Lars Bungum, Charles Yijia Huang, Abeer Kashar

TL;DR

This paper reframes temporal reasoning as a historical reasoning task by testing LLMs on a 1940 Norwegian trivia book, prompting models to answer as if in 1940 and evaluating them in English and Norwegian. It introduces a three-step methodology—dataset creation from The Book, querying diverse LLM families, and LLM-as-judge evaluation—alongside a separate analysis of Norwegian-focused models, including NorwAI-Magistral24B. Findings show that model scale improves performance and English prompts generally yield better results, though a large Norwegian model can match or exceed some international models, highlighting both the potential and limits of TR in current LLMs. The work advances TR evaluation for non-English contexts, underscores challenges from semantic drift and data leakage, and points toward future directions like retrieval augmentation and broader multilingual datasets to deepen understanding of temporal reasoning in LLMs.

Abstract

In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.

If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs

TL;DR

This paper reframes temporal reasoning as a historical reasoning task by testing LLMs on a 1940 Norwegian trivia book, prompting models to answer as if in 1940 and evaluating them in English and Norwegian. It introduces a three-step methodology—dataset creation from The Book, querying diverse LLM families, and LLM-as-judge evaluation—alongside a separate analysis of Norwegian-focused models, including NorwAI-Magistral24B. Findings show that model scale improves performance and English prompts generally yield better results, though a large Norwegian model can match or exceed some international models, highlighting both the potential and limits of TR in current LLMs. The work advances TR evaluation for non-English contexts, underscores challenges from semantic drift and data leakage, and points toward future directions like retrieval augmentation and broader multilingual datasets to deepen understanding of temporal reasoning in LLMs.

Abstract

In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.

Paper Structure

This paper contains 19 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Group A Model Performance