High-Quality Data Augmentation for Low-Resource NMT: Combining a Translation Memory, a GAN Generator, and Filtering
Hengjie Liu, Ruibo Hou, Yves Lepage
TL;DR
This work tackles low-resource NMT data scarcity by leveraging a source-language monolingual corpus through a GAN-based augmentation framework that avoids degrading the translation model. It integrates Translation Memory into the NMT input to expand training data and employs a GAN with a Transformer generator and a CNN–BiLSTM discriminator to utilize source-side monolingual data without interference. A novel high-quality filtering procedure uses length and perplexity ratios, grounded in Gaussian-like statistics from natural parallel data, and may include similar-domain retrieval to select suitable sentences. Empirical results on German–Upper Sorbian demonstrate that combining TM with GAN yields the largest gains (≈+2.9 BLEU over baseline), illustrating the practical potential of source-side augmentation for rare-language translation tasks. The approach offers a pathway to better data efficiency in NMT, though GAN instability and TM computational costs remain challenges for broader deployment.
Abstract
Back translation, as a technique for extending a dataset, is widely used by researchers in low-resource language translation tasks. It typically translates from the target to the source language to ensure high-quality translation results. This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings. We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator. Additionally, this paper integrates Translation Memory (TM) with NMT, increasing the amount of data available to the generator. Moreover, we propose a novel procedure to filter the synthetic sentence pairs during the augmentation process, ensuring the high quality of the data.
