Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair
Maksim Borisov, Zhanibek Kozhirbayev, Valentin Malykh
TL;DR
This work tackles machine translation for a low-resource, code-switched language pair, Kazakh-Russian, by generating synthetic code-switching data and introducing the KRCS evaluation corpus. The authors demonstrate that synthetic augmentation, especially cs-5 using SimAlign-based alignment, can significantly enhance translation quality when fine-tuning large multilingual models, achieving a $BLEU$ of around $16.48$ with competitive comparisons to a commercial system. They also present domain adaptation via translation of a Russian tweet corpus to Kazakh, underscoring the importance of domain-aligned data for code-switching translation. Overall, the study introduces a valuable code-switching dataset, a practical augmentation methodology, and evidence that near-commercial performance is achievable in a challenging low-resource CSW setting, with implications for extending such methods to other language pairs.
Abstract
Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.
