KazParC: Kazakh Parallel Corpus for Machine Translation
Rustem Yeshpanov, Alina Polonskaya, Huseyin Atakan Varol
TL;DR
KazParC presents the first and largest openly available parallel corpus for Kazakh–English–Russian–Turkish MT, comprising 371,902 human-translated sentences across diverse domains. The authors also develop Tilmash, a neural MT model fine-tuned on KazParC and augmented with a large synthetic SynC corpus, and demonstrate competitive or superior performance relative to Google Translate and Yandex on BLEU and chrF metrics. The work includes a comprehensive data pipeline—from data sourcing and careful human translation to preprocessing, domain labeling, and integration of synthetic data—and validates results against the FLoRes benchmark. The release under CC BY 4.0 facilitates broad reuse for MT research and practical Kazakh-language translation across domains, with future plans to expand domains and refine model performance. Overall, KazParC and Tilmash significantly advance MT resources for Kazakh and illustrate the value of combining high-quality human data with synthetic data to improve low-resource language translation.
Abstract
We introduce KazParC, a parallel corpus designed for machine translation across Kazakh, English, Russian, and Turkish. The first and largest publicly available corpus of its kind, KazParC contains a collection of 371,902 parallel sentences covering different domains and developed with the assistance of human translators. Our research efforts also extend to the development of a neural machine translation model nicknamed Tilmash. Remarkably, the performance of Tilmash is on par with, and in certain instances, surpasses that of industry giants, such as Google Translate and Yandex Translate, as measured by standard evaluation metrics, such as BLEU and chrF. Both KazParC and Tilmash are openly available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.
