Table of Contents
Fetching ...

Shared Heritage, Distinct Writing: Rethinking Resource Selection for East Asian Historical Documents

Seyoung Song, Haneul Yoo, Jiho Jin, Kyunghyun Cho, Alice Oh

TL;DR

This work questions the prevalent assumption that Classical Chinese resources universally boost NLP for neighboring East Asian historical languages. Through MT, NER, and PR experiments across Hanja and Kanbun, the authors show minimal gains from incorporating Classical Chinese data, with notable exceptions in extremely low-resource scenarios; improvements largely vanish as local Hj data grow. They introduce the Korean Literary Collections (KLC) to diversify Hj data and analyze domain- and architecture-dependent effects, finding that cross-lingual transfer is highly sensitive to domain, resource balance, and linguistic differences beyond shared scripts. A key finding is that Kanbun can benefit from cross-lingual signals in low-resource settings, but vocabulary divergence and deeper linguistic disparities limit transfer for other contexts. The study emphasizes empirical validation, presents a threshold for diminishing benefits, and provides public code and data to guide future work in historical Sinosphere NLP.

Abstract

Historical documents in the Sinosphere are known to share common formats and practices, particularly in veritable records compiled by court historians. This shared linguistic heritage has led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan, which remain relatively low-resource. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within $\pm{}0.0068$ F1-score for sequence labeling tasks and up to $+0.84$ BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These findings emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.

Shared Heritage, Distinct Writing: Rethinking Resource Selection for East Asian Historical Documents

TL;DR

This work questions the prevalent assumption that Classical Chinese resources universally boost NLP for neighboring East Asian historical languages. Through MT, NER, and PR experiments across Hanja and Kanbun, the authors show minimal gains from incorporating Classical Chinese data, with notable exceptions in extremely low-resource scenarios; improvements largely vanish as local Hj data grow. They introduce the Korean Literary Collections (KLC) to diversify Hj data and analyze domain- and architecture-dependent effects, finding that cross-lingual transfer is highly sensitive to domain, resource balance, and linguistic differences beyond shared scripts. A key finding is that Kanbun can benefit from cross-lingual signals in low-resource settings, but vocabulary divergence and deeper linguistic disparities limit transfer for other contexts. The study emphasizes empirical validation, presents a threshold for diminishing benefits, and provides public code and data to guide future work in historical Sinosphere NLP.

Abstract

Historical documents in the Sinosphere are known to share common formats and practices, particularly in veritable records compiled by court historians. This shared linguistic heritage has led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan, which remain relatively low-resource. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within F1-score for sequence labeling tasks and up to BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These findings emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.

Paper Structure

This paper contains 46 sections, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Language transfer from Classical Chinese to neighboring countries in Sinosphere. Classical Chinese had been transferred to neighboring countries in East Asia and used from the 6th century BC to the 20th century AD. While modern languages (gray) are different from each other, ancient languages (black) are mutually understandable.
  • Figure 2: Comparison of models trained with and without Classical Chinese (Lzh). Results show BLEU scores (MT) and F1-scores (NER, PR) across three document types: Hanja royal records (Hj$^{\text{R}}$), Hanja literary works (Hj$^{\text{L}}$), and Classical Chinese (Lzh), with error bars of 95% confidence intervals for MT and standard deviations for NER and PR. Statistical significance is denoted as: *** ($p < 0.001$), ** ($p < 0.01$), * ($p < 0.05$), and n.s. (not significant).
  • Figure 3: Performance impact of Classical Chinese training data across varying Hanja data ratios. The $x$-axis shows the ratio $r$, where Hj:Lzh = $r$:1 denotes the proportion of Hanja data against Classical Chinese data, while the $y$-axis shows the relative performance differences in percentage (%) between models trained with/without Classical Chinese data. Square and x markers indicate statistically significant differences ($p < 0.05$) and non-significant differences, respectively.
  • Figure 4: Distribution of unique characters across writing systems in the Sinosphere. The bars represent the proportion of shared characters with Classical Chinese versus language-specific variants in each writing system.
  • Figure 5: Heatmap of character coverage gaps between Sinosphere languages. Each cell shows the percentage of characters in the row language that are not the most common characters in the column language at 99% frequency threshold.
  • ...and 4 more figures