Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese
Zilong Li, Jie Cao
TL;DR
This work treats the Kundoku annotation system for translating Classical Chinese into Japanese as a sequence tagging task, grounding Kaeriten and Okurigana in a stack-based reading model validated by a pushdown automaton. It introduces a new large-scale dataset derived from online bilingual sources and demonstrates that multitask learning with auxiliary Chinese NLP tasks improves main Kundoku tagging performance in low-resource settings. The study also benchmarked multiple LLMs, finding strong translation performance but weaker Kaeriten annotation, and shows that the proposed tagging-and-transduction method offers a practical, annotation-supporting alternative. Overall, the approach provides a principled, resource-efficient path to automate Kundoku-style translation while highlighting the complementary role of LLMs.
Abstract
Ancient people translated classical Chinese into Japanese by annotating around each character. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research of this annotation and translation system is a facing low-resource problem. We release this problem by introducing a LLM-based annotation pipeline and construct a new dataset from digitalized open-source translation data. We show that under the low-resource setting, introducing auxiliary Chinese NLP tasks has a promoting effect on the training of sequence tagging tasks. We also evaluate the performance of large language models. They achieve high scores in direct machine translation, but they are confused when being asked to annotate characters. Our method could work as a supplement of LLMs.
