Table of Contents
Fetching ...

Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair

Maksim Borisov, Zhanibek Kozhirbayev, Valentin Malykh

TL;DR

This work tackles machine translation for a low-resource, code-switched language pair, Kazakh-Russian, by generating synthetic code-switching data and introducing the KRCS evaluation corpus. The authors demonstrate that synthetic augmentation, especially cs-5 using SimAlign-based alignment, can significantly enhance translation quality when fine-tuning large multilingual models, achieving a $BLEU$ of around $16.48$ with competitive comparisons to a commercial system. They also present domain adaptation via translation of a Russian tweet corpus to Kazakh, underscoring the importance of domain-aligned data for code-switching translation. Overall, the study introduces a valuable code-switching dataset, a practical augmentation methodology, and evidence that near-commercial performance is achievable in a challenging low-resource CSW setting, with implications for extending such methods to other language pairs.

Abstract

Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.

Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair

TL;DR

This work tackles machine translation for a low-resource, code-switched language pair, Kazakh-Russian, by generating synthetic code-switching data and introducing the KRCS evaluation corpus. The authors demonstrate that synthetic augmentation, especially cs-5 using SimAlign-based alignment, can significantly enhance translation quality when fine-tuning large multilingual models, achieving a of around with competitive comparisons to a commercial system. They also present domain adaptation via translation of a Russian tweet corpus to Kazakh, underscoring the importance of domain-aligned data for code-switching translation. Overall, the study introduces a valuable code-switching dataset, a practical augmentation methodology, and evidence that near-commercial performance is achievable in a challenging low-resource CSW setting, with implications for extending such methods to other language pairs.

Abstract

Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.

Paper Structure

This paper contains 25 sections, 1 figure, 10 tables.

Figures (1)

  • Figure 1: Sentence embedding visualization with dataset centroids.