Table of Contents
Fetching ...

No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

Dmitry Karpov

TL;DR

This study tackles translation for five Turkic language pairs under low-resource conditions by comparing synthetic-data augmented LoRA-based fine-tuning with prompting-based retrieval strategies. It demonstrates that multi-language fine-tuning with LoRA on augmented data yields the strongest results for Bashkir and Kazakh, while retrieval-augmented prompting substantially boosts Chuvash translation in extremely data-scarce settings; Tatar and Kyrgyz show mixed outcomes with zero-shot prompting sometimes surpassing finetuning. A key contribution is the YaTURK-7lang dataset and released model weights, enabling further research and practical MT applications for these languages. The work highlights that no single approach dominates across all languages, underscoring the need for language-adaptive strategies in low-resource MT.

Abstract

We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.

No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

TL;DR

This study tackles translation for five Turkic language pairs under low-resource conditions by comparing synthetic-data augmented LoRA-based fine-tuning with prompting-based retrieval strategies. It demonstrates that multi-language fine-tuning with LoRA on augmented data yields the strongest results for Bashkir and Kazakh, while retrieval-augmented prompting substantially boosts Chuvash translation in extremely data-scarce settings; Tatar and Kyrgyz show mixed outcomes with zero-shot prompting sometimes surpassing finetuning. A key contribution is the YaTURK-7lang dataset and released model weights, enabling further research and practical MT applications for these languages. The work highlights that no single approach dominates across all languages, underscoring the need for language-adaptive strategies in low-resource MT.

Abstract

We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.
Paper Structure (8 sections, 2 tables)