Table of Contents
Fetching ...

Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

Zilong Li, Jie Cao

TL;DR

This work treats the Kundoku annotation system for translating Classical Chinese into Japanese as a sequence tagging task, grounding Kaeriten and Okurigana in a stack-based reading model validated by a pushdown automaton. It introduces a new large-scale dataset derived from online bilingual sources and demonstrates that multitask learning with auxiliary Chinese NLP tasks improves main Kundoku tagging performance in low-resource settings. The study also benchmarked multiple LLMs, finding strong translation performance but weaker Kaeriten annotation, and shows that the proposed tagging-and-transduction method offers a practical, annotation-supporting alternative. Overall, the approach provides a principled, resource-efficient path to automate Kundoku-style translation while highlighting the complementary role of LLMs.

Abstract

Ancient people translated classical Chinese into Japanese by annotating around each character. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research of this annotation and translation system is a facing low-resource problem. We release this problem by introducing a LLM-based annotation pipeline and construct a new dataset from digitalized open-source translation data. We show that under the low-resource setting, introducing auxiliary Chinese NLP tasks has a promoting effect on the training of sequence tagging tasks. We also evaluate the performance of large language models. They achieve high scores in direct machine translation, but they are confused when being asked to annotate characters. Our method could work as a supplement of LLMs.

Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

TL;DR

This work treats the Kundoku annotation system for translating Classical Chinese into Japanese as a sequence tagging task, grounding Kaeriten and Okurigana in a stack-based reading model validated by a pushdown automaton. It introduces a new large-scale dataset derived from online bilingual sources and demonstrates that multitask learning with auxiliary Chinese NLP tasks improves main Kundoku tagging performance in low-resource settings. The study also benchmarked multiple LLMs, finding strong translation performance but weaker Kaeriten annotation, and shows that the proposed tagging-and-transduction method offers a practical, annotation-supporting alternative. Overall, the approach provides a principled, resource-efficient path to automate Kundoku-style translation while highlighting the complementary role of LLMs.

Abstract

Ancient people translated classical Chinese into Japanese by annotating around each character. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research of this annotation and translation system is a facing low-resource problem. We release this problem by introducing a LLM-based annotation pipeline and construct a new dataset from digitalized open-source translation data. We show that under the low-resource setting, introducing auxiliary Chinese NLP tasks has a promoting effect on the training of sequence tagging tasks. We also evaluate the performance of large language models. They achieve high scores in direct machine translation, but they are confused when being asked to annotate characters. Our method could work as a supplement of LLMs.

Paper Structure

This paper contains 26 sections, 4 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Classical Chinese sentence with marks and its Japanese translation. Green punctuations are Kutōten to segment sentences. Blue symbols are Kaeriten indicating the reading order. Red characters are Okurigana for grammatical and inflectional roles.
  • Figure 2: Examples of Kaeriten. Sentences on the left are characters with marks. Sentences on the right are characters in the correct order. Black arrows represent characters being read after the target character. Stack operations are listed under each example.
  • Figure 3: An example of our multitask learning model's structure. Solid lines represent the flow of embeddings and logits. Dashed lines represent the flow of loss.