Table of Contents
Fetching ...

Neural machine translation system for Lezgian, Russian and Azerbaijani languages

Alidar Asvarov, Andrey Grabovoy

TL;DR

This work introduces the first neural MT system for translating between Lezgian, Russian, and Azerbaijani, supported by a substantial parallel and monolingual Lezgian corpus and a LaBSE-based sentence encoder tailored to Lezgian. Through a series of data- and language-pair ablations, the authors demonstrate that adding more language pairs does not consistently improve quality, achieving BLEU on the order of 26.14–29.48 across directions and reporting strong Lezgian fluency in zero-shot LLM assessments, albeit with safety-driven refusals in some cases. The contributions include releasing the parallel/montal data, the final NMT model, and the sentence encoder, providing a foundation for further MT work on Lezgian and other low-resource languages. The work highlights both the potential and the limitations of leveraging multilingual models and LLMs for endangered languages, underscoring the need for larger, more diverse corpora and domain-aware evaluation.

Abstract

We release the first neural machine translation system for translation between Russian, Azerbaijani and the endangered Lezgian languages, as well as monolingual and parallel datasets collected and aligned for training and evaluating the system. Multiple experiments are conducted to identify how different sets of training language pairs and data domains can influence the resulting translation quality. We achieve BLEU scores of 26.14 for Lezgian-Azerbaijani, 22.89 for Azerbaijani-Lezgian, 29.48 for Lezgian-Russian and 24.25 for Russian-Lezgian pairs. The quality of zero-shot translation is assessed on a Large Language Model, showing its high level of fluency in Lezgian. However, the model often refuses to translate, justifying itself with its incompetence. We contribute our translation model along with the collected parallel and monolingual corpora and sentence encoder for the Lezgian language.

Neural machine translation system for Lezgian, Russian and Azerbaijani languages

TL;DR

This work introduces the first neural MT system for translating between Lezgian, Russian, and Azerbaijani, supported by a substantial parallel and monolingual Lezgian corpus and a LaBSE-based sentence encoder tailored to Lezgian. Through a series of data- and language-pair ablations, the authors demonstrate that adding more language pairs does not consistently improve quality, achieving BLEU on the order of 26.14–29.48 across directions and reporting strong Lezgian fluency in zero-shot LLM assessments, albeit with safety-driven refusals in some cases. The contributions include releasing the parallel/montal data, the final NMT model, and the sentence encoder, providing a foundation for further MT work on Lezgian and other low-resource languages. The work highlights both the potential and the limitations of leveraging multilingual models and LLMs for endangered languages, underscoring the need for larger, more diverse corpora and domain-aware evaluation.

Abstract

We release the first neural machine translation system for translation between Russian, Azerbaijani and the endangered Lezgian languages, as well as monolingual and parallel datasets collected and aligned for training and evaluating the system. Multiple experiments are conducted to identify how different sets of training language pairs and data domains can influence the resulting translation quality. We achieve BLEU scores of 26.14 for Lezgian-Azerbaijani, 22.89 for Azerbaijani-Lezgian, 29.48 for Lezgian-Russian and 24.25 for Russian-Lezgian pairs. The quality of zero-shot translation is assessed on a Large Language Model, showing its high level of fluency in Lezgian. However, the model often refuses to translate, justifying itself with its incompetence. We contribute our translation model along with the collected parallel and monolingual corpora and sentence encoder for the Lezgian language.
Paper Structure (15 sections, 11 tables)