Syntactic Transfer to Kyrgyz Using the Treebank Translation Method
Anton Alekseev, Alina Tillabaeva, Gulnara Dzh. Kabaeva, Sergey I. Nikolenko
TL;DR
This paper addresses the challenge of building high-quality Kyrgyz syntactic corpora by proposing a semi-automatic, cross-lingual transfer method that projects syntactic annotations from Turkish to Kyrgyz using a treebank-translation approach. It implements a pipeline combining Turkish dependency parses, machine translation (including GPT-4o with task-focused prompts), and word-alignment-based annotation projection, with lemmatization via apertium-kir. Evaluations on the TueCL UD Kyrgyz treebank show that this approach yields higher syntactic annotation accuracy than a monolingual model trained on KTMU, and it introduces a method to gauge manual annotation complexity. The work provides a reusable Python package and demonstrates a practical route to rapidly expanding Kyrgyz syntactic resources, with broader implications for other low-resource languages.
Abstract
The Kyrgyz language, as a low-resource language, requires significant effort to create high-quality syntactic corpora. This study proposes an approach to simplify the development process of a syntactic corpus for Kyrgyz. We present a tool for transferring syntactic annotations from Turkish to Kyrgyz based on a treebank translation method. The effectiveness of the proposed tool was evaluated using the TueCL treebank. The results demonstrate that this approach achieves higher syntactic annotation accuracy compared to a monolingual model trained on the Kyrgyz KTMU treebank. Additionally, the study introduces a method for assessing the complexity of manual annotation for the resulting syntactic trees, contributing to further optimization of the annotation process.
