Table of Contents
Fetching ...

Syntactic Transfer to Kyrgyz Using the Treebank Translation Method

Anton Alekseev, Alina Tillabaeva, Gulnara Dzh. Kabaeva, Sergey I. Nikolenko

TL;DR

This paper addresses the challenge of building high-quality Kyrgyz syntactic corpora by proposing a semi-automatic, cross-lingual transfer method that projects syntactic annotations from Turkish to Kyrgyz using a treebank-translation approach. It implements a pipeline combining Turkish dependency parses, machine translation (including GPT-4o with task-focused prompts), and word-alignment-based annotation projection, with lemmatization via apertium-kir. Evaluations on the TueCL UD Kyrgyz treebank show that this approach yields higher syntactic annotation accuracy than a monolingual model trained on KTMU, and it introduces a method to gauge manual annotation complexity. The work provides a reusable Python package and demonstrates a practical route to rapidly expanding Kyrgyz syntactic resources, with broader implications for other low-resource languages.

Abstract

The Kyrgyz language, as a low-resource language, requires significant effort to create high-quality syntactic corpora. This study proposes an approach to simplify the development process of a syntactic corpus for Kyrgyz. We present a tool for transferring syntactic annotations from Turkish to Kyrgyz based on a treebank translation method. The effectiveness of the proposed tool was evaluated using the TueCL treebank. The results demonstrate that this approach achieves higher syntactic annotation accuracy compared to a monolingual model trained on the Kyrgyz KTMU treebank. Additionally, the study introduces a method for assessing the complexity of manual annotation for the resulting syntactic trees, contributing to further optimization of the annotation process.

Syntactic Transfer to Kyrgyz Using the Treebank Translation Method

TL;DR

This paper addresses the challenge of building high-quality Kyrgyz syntactic corpora by proposing a semi-automatic, cross-lingual transfer method that projects syntactic annotations from Turkish to Kyrgyz using a treebank-translation approach. It implements a pipeline combining Turkish dependency parses, machine translation (including GPT-4o with task-focused prompts), and word-alignment-based annotation projection, with lemmatization via apertium-kir. Evaluations on the TueCL UD Kyrgyz treebank show that this approach yields higher syntactic annotation accuracy than a monolingual model trained on KTMU, and it introduces a method to gauge manual annotation complexity. The work provides a reusable Python package and demonstrates a practical route to rapidly expanding Kyrgyz syntactic resources, with broader implications for other low-resource languages.

Abstract

The Kyrgyz language, as a low-resource language, requires significant effort to create high-quality syntactic corpora. This study proposes an approach to simplify the development process of a syntactic corpus for Kyrgyz. We present a tool for transferring syntactic annotations from Turkish to Kyrgyz based on a treebank translation method. The effectiveness of the proposed tool was evaluated using the TueCL treebank. The results demonstrate that this approach achieves higher syntactic annotation accuracy compared to a monolingual model trained on the Kyrgyz KTMU treebank. Additionally, the study introduces a method for assessing the complexity of manual annotation for the resulting syntactic trees, contributing to further optimization of the annotation process.

Paper Structure

This paper contains 22 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An example of Russian and Enlish sentences, aligned.
  • Figure 2: Translations of the sentence "Стояли звери около двери" (en. "The beasts were standing by the door") into Turkish and Kyrgyz languages demonstrate a very similar word order. Above the Turkish sentence, its Universal Dependencies parse is shown --- obtained using UDPipe udpipe2 (model turkish-imst-ud-2.12-230717, trained on the UD-IMST treebank data sulubacak2016universalsulubacak2018implementing).
  • Figure 3: An example from the TueCL treebank (at the top) and its Turkish translation (bottom), annotated using the Stanza-IMST-charlm model. Words without a corresponding counterpart in the Turkish sentence are highlighted in red, while dependency relations (deprel) predicted incorrectly are marked in dark red.