Table of Contents
Fetching ...

Can Language Models Handle a Non-Gregorian Calendar? The Case of the Japanese wareki

Mutsumi Sasaki, Go Kamoda, Ryosuke Takahashi, Kosuke Sato, Kentaro Inui, Keisuke Sakaguchi, Benjamin Heinzerling

TL;DR

This work investigates whether language models can handle the Japanese wareki, a non-Gregorian calendar, by constructing three targeted tasks (CalendarConversion, JapaneseCalendarArithmetic, BirthYearRecall) and evaluating English-centric, Japanese-centric, and frontier models. The study finds that while Japanese-centric and frontier models can perform basic calendar conversions with high accuracy, they struggle with era-bound arithmetic and birth-year recall, with English-centric models showing substantial failures and cross- calendar biases influencing results. Error analysis points to corpus frequency of wareki expressions and a Gregorian bias in knowledge as key drivers of performance gaps, underscoring the need for culture-specific temporal reasoning in LMs. The findings emphasize the importance of extending temporal reasoning benchmarks beyond the Gregorian calendar to improve cultural competence and cross-cultural NLP tasks in real-world settings.

Abstract

Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well language models handle one such non-Gregorian system: the Japanese wareki. We create datasets that require temporal knowledge and reasoning in using wareki dates. Evaluating open and closed LMs, we find that some models can perform calendar conversions, but GPT-4o, Deepseek V3, and even Japanese-centric models struggle with Japanese calendar arithmetic and knowledge involving wareki dates. Error analysis suggests corpus frequency of Japanese calendar expressions and a Gregorian bias in the model's knowledge as possible explanations. Our results show the importance of developing LMs that are better equipped for culture-specific tasks such as calendar understanding.

Can Language Models Handle a Non-Gregorian Calendar? The Case of the Japanese wareki

TL;DR

This work investigates whether language models can handle the Japanese wareki, a non-Gregorian calendar, by constructing three targeted tasks (CalendarConversion, JapaneseCalendarArithmetic, BirthYearRecall) and evaluating English-centric, Japanese-centric, and frontier models. The study finds that while Japanese-centric and frontier models can perform basic calendar conversions with high accuracy, they struggle with era-bound arithmetic and birth-year recall, with English-centric models showing substantial failures and cross- calendar biases influencing results. Error analysis points to corpus frequency of wareki expressions and a Gregorian bias in knowledge as key drivers of performance gaps, underscoring the need for culture-specific temporal reasoning in LMs. The findings emphasize the importance of extending temporal reasoning benchmarks beyond the Gregorian calendar to improve cultural competence and cross-cultural NLP tasks in real-world settings.

Abstract

Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well language models handle one such non-Gregorian system: the Japanese wareki. We create datasets that require temporal knowledge and reasoning in using wareki dates. Evaluating open and closed LMs, we find that some models can perform calendar conversions, but GPT-4o, Deepseek V3, and even Japanese-centric models struggle with Japanese calendar arithmetic and knowledge involving wareki dates. Error analysis suggests corpus frequency of Japanese calendar expressions and a Gregorian bias in the model's knowledge as possible explanations. Our results show the importance of developing LMs that are better equipped for culture-specific tasks such as calendar understanding.

Paper Structure

This paper contains 24 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: In the Japanese calendar (wareki), years are expressed using era names, which change irregularly according to historic events such as an emperor's accession. For example, the Reiwa era began on May 1, 2019, with the accession of Emperor Naruhito, so 2020 corresponds to Reiwa 2. In addition to showing the five eras of modern Japan (bottom) in relation to the Gregorian calendar (top), this figure illustrates three tasks designed to evaluate how LMs handle wareki system: (1) CalendarConversion between Gregorian calendar and wareki; (2) JapaneseCalendarArithmetic across era boundaries; (3) BirthYearRecall in both calendar systems.
  • Figure 2: Performance on CalendarConversion (Gregorian→Japanese setting). Japanese-centric LMs (red labels) and frontier LMs (purple labels) perform nearly perfectly across all eras. Some English-centric LMs (blue labels) fail even at simple conversions.
  • Figure 3: Performance on JapaneseCalendarArithmetic. A large performance gap is observed between Japanese-centric LMs (red labels) and English-centric LMs (blue labels). Even frontier LMs (purple labels) struggle with this task.
  • Figure 4: Comparison of BirthYearRecall accuracy in both Gregorian and wareki formats. The diagonal marks equal accuracy; below it indicates a Gregorian bias. Even Japanese-centric LMs and frontier LMs exhibit a strong bias towards the Gregorian calendar, and Japanese-centric LMs perform comparatively better with the Japanese calendar than English-centric LMs.
  • Figure 5: Analysis of why many models fail from the perspectives of typical error patterns. In JapaneseCalendarArithmetic, out-of-range errors (e.g., generating "Heisei 37") may contribute to failures in newer eras. In BirthYearRecall, Gregorian bias errors (responding in Gregorian years despite 3-shot wareki prompts) cause failures, especially in newer eras.
  • ...and 6 more figures