Table of Contents
Fetching ...

High-Quality Data Augmentation for Low-Resource NMT: Combining a Translation Memory, a GAN Generator, and Filtering

Hengjie Liu, Ruibo Hou, Yves Lepage

TL;DR

This work tackles low-resource NMT data scarcity by leveraging a source-language monolingual corpus through a GAN-based augmentation framework that avoids degrading the translation model. It integrates Translation Memory into the NMT input to expand training data and employs a GAN with a Transformer generator and a CNN–BiLSTM discriminator to utilize source-side monolingual data without interference. A novel high-quality filtering procedure uses length and perplexity ratios, grounded in Gaussian-like statistics from natural parallel data, and may include similar-domain retrieval to select suitable sentences. Empirical results on German–Upper Sorbian demonstrate that combining TM with GAN yields the largest gains (≈+2.9 BLEU over baseline), illustrating the practical potential of source-side augmentation for rare-language translation tasks. The approach offers a pathway to better data efficiency in NMT, though GAN instability and TM computational costs remain challenges for broader deployment.

Abstract

Back translation, as a technique for extending a dataset, is widely used by researchers in low-resource language translation tasks. It typically translates from the target to the source language to ensure high-quality translation results. This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings. We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator. Additionally, this paper integrates Translation Memory (TM) with NMT, increasing the amount of data available to the generator. Moreover, we propose a novel procedure to filter the synthetic sentence pairs during the augmentation process, ensuring the high quality of the data.

High-Quality Data Augmentation for Low-Resource NMT: Combining a Translation Memory, a GAN Generator, and Filtering

TL;DR

This work tackles low-resource NMT data scarcity by leveraging a source-language monolingual corpus through a GAN-based augmentation framework that avoids degrading the translation model. It integrates Translation Memory into the NMT input to expand training data and employs a GAN with a Transformer generator and a CNN–BiLSTM discriminator to utilize source-side monolingual data without interference. A novel high-quality filtering procedure uses length and perplexity ratios, grounded in Gaussian-like statistics from natural parallel data, and may include similar-domain retrieval to select suitable sentences. Empirical results on German–Upper Sorbian demonstrate that combining TM with GAN yields the largest gains (≈+2.9 BLEU over baseline), illustrating the practical potential of source-side augmentation for rare-language translation tasks. The approach offers a pathway to better data efficiency in NMT, though GAN instability and TM computational costs remain challenges for broader deployment.

Abstract

Back translation, as a technique for extending a dataset, is widely used by researchers in low-resource language translation tasks. It typically translates from the target to the source language to ensure high-quality translation results. This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings. We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator. Additionally, this paper integrates Translation Memory (TM) with NMT, increasing the amount of data available to the generator. Moreover, we propose a novel procedure to filter the synthetic sentence pairs during the augmentation process, ensuring the high quality of the data.
Paper Structure (12 sections, 2 equations, 13 figures, 2 tables)

This paper contains 12 sections, 2 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Process of integrating TM into NMT, where $d(s,s_t)$ means the Euclidean distance between sentence vectors of source input $s$ and a source sentence $s_t$ in TM. The input consists of $s$, $s_t$ and $t_t$. The corresponding output is the translation $t$ of $s$.
  • Figure 2: The way utilizing monolingual corpus in source side enhance NMT task. The Generator is a vanilla Transformer model. The Discriminator is a fusion neural network model design by ourselves.
  • Figure 3: A process for filtering high-quality translations. We filter both the source sentences and the target sentences. We calculate features from natural language corpora (the ratio of sentence length and perplexity) to serve as filtering criteria. This allow us to filter translations that are more fit with natural language sentences. Finally, we validate the effectiveness of the translation results using standard data augmentation experiments.
  • Figure 4:
  • Figure 5:
  • ...and 8 more figures