Table of Contents
Fetching ...

Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data

Jean Seo, Sungjoo Byun, Minha Kang, Sangah Lee

TL;DR

This work tackles the endangered Manchu language by introducing Mergen, the first Manchu-Korean MT model, designed to work under extreme data scarcity. It combines a seq2seq encoder-decoder with a bi-directional GRU and a GloVe-guided data augmentation pipeline to expand both training data and vocabulary, leveraging Mǎnwén Lǎodàng and a Manchu-Korean dictionary as parallel resources. The approach yields substantial BLEU gains (up to approximately 38 on the primary test and around 28 on the combined test) compared to near-zero baselines, demonstrating a viable path for MT in ultra-low-resource settings. Overall, the study provides a practical, data-driven strategy for preserving Manchu and offers a template for applying augmentation and neural MT to other endangered languages.

Abstract

The Manchu language, with its roots in the historical Manchurian region of Northeast China, is now facing a critical threat of extinction, as there are very few speakers left. In our efforts to safeguard the Manchu language, we introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation (MT) model. To develop this model, we utilize valuable resources such as the Manwen Laodang(a historical book) and a Manchu-Korean dictionary. Due to the scarcity of a Manchu-Korean parallel dataset, we expand our data by employing word replacement guided by GloVe embeddings, trained on both monolingual and parallel texts. Our approach is built around an encoder-decoder neural machine translation model, incorporating a bi-directional Gated Recurrent Unit (GRU) layer. The experiments have yielded promising results, showcasing a significant enhancement in Manchu-Korean translation, with a remarkable 20-30 point increase in the BLEU score.

Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data

TL;DR

This work tackles the endangered Manchu language by introducing Mergen, the first Manchu-Korean MT model, designed to work under extreme data scarcity. It combines a seq2seq encoder-decoder with a bi-directional GRU and a GloVe-guided data augmentation pipeline to expand both training data and vocabulary, leveraging Mǎnwén Lǎodàng and a Manchu-Korean dictionary as parallel resources. The approach yields substantial BLEU gains (up to approximately 38 on the primary test and around 28 on the combined test) compared to near-zero baselines, demonstrating a viable path for MT in ultra-low-resource settings. Overall, the study provides a practical, data-driven strategy for preserving Manchu and offers a template for applying augmentation and neural MT to other endangered languages.

Abstract

The Manchu language, with its roots in the historical Manchurian region of Northeast China, is now facing a critical threat of extinction, as there are very few speakers left. In our efforts to safeguard the Manchu language, we introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation (MT) model. To develop this model, we utilize valuable resources such as the Manwen Laodang(a historical book) and a Manchu-Korean dictionary. Due to the scarcity of a Manchu-Korean parallel dataset, we expand our data by employing word replacement guided by GloVe embeddings, trained on both monolingual and parallel texts. Our approach is built around an encoder-decoder neural machine translation model, incorporating a bi-directional Gated Recurrent Unit (GRU) layer. The experiments have yielded promising results, showcasing a significant enhancement in Manchu-Korean translation, with a remarkable 20-30 point increase in the BLEU score.
Paper Structure (14 sections, 2 figures, 3 tables)

This paper contains 14 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Our data augmentation methodology. First, we train ten versions of GloVe embedding models, varying in the minimum token length of source data and window size. Then, the presumable synonym for the target word is selected via comparing the frequency of outputs from each model. Finally, we augment data through replacing original words with synonyms if possible. The pair of original and substituted words are in the same color.
  • Figure 2: Example of Romanizations of Manchu text and Korean text