Table of Contents
Fetching ...

Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation

Linda Zeng

TL;DR

The paper addresses neural machine translation for low-resource languages by proposing a generative-adversarial network (GAN) based data augmentation framework that generates monolingual low-resource language data to improve training. The approach trains an encoder-decoder on parallel data, freezes it, and uses a GAN to map random noise to latent-space embeddings that the decoder can transform into augmented target-language sentences. In a simulated setup with $20{,}000$ training pairs, the encoder-decoder achieves 69.3% test accuracy while the GAN can produce coherent sentences, albeit with repetition and grammatical inconsistencies, indicating promising potential with room for improvement. The work demonstrates the feasibility of GAN-based data augmentation for low-resource NMT and outlines concrete directions for enhancing sentence quality, evaluating impact on end-to-end MT, and extending to parallel data.

Abstract

Neural Machine Translation (NMT) systems struggle when translating to and from low-resource languages, which lack large-scale data corpora for models to use for training. As manual data curation is expensive and time-consuming, we propose utilizing a generative-adversarial network (GAN) to augment low-resource language data. When training on a very small amount of language data (under 20,000 sentences) in a simulated low-resource setting, our model shows potential at data augmentation, generating monolingual language data with sentences such as "ask me that healthy lunch im cooking up," and "my grandfather work harder than your grandfather before." Our novel data augmentation approach takes the first step in investigating the capability of GANs in low-resource NMT, and our results suggest that there is promise for future extension of GANs to low-resource NMT.

Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation

TL;DR

The paper addresses neural machine translation for low-resource languages by proposing a generative-adversarial network (GAN) based data augmentation framework that generates monolingual low-resource language data to improve training. The approach trains an encoder-decoder on parallel data, freezes it, and uses a GAN to map random noise to latent-space embeddings that the decoder can transform into augmented target-language sentences. In a simulated setup with training pairs, the encoder-decoder achieves 69.3% test accuracy while the GAN can produce coherent sentences, albeit with repetition and grammatical inconsistencies, indicating promising potential with room for improvement. The work demonstrates the feasibility of GAN-based data augmentation for low-resource NMT and outlines concrete directions for enhancing sentence quality, evaluating impact on end-to-end MT, and extending to parallel data.

Abstract

Neural Machine Translation (NMT) systems struggle when translating to and from low-resource languages, which lack large-scale data corpora for models to use for training. As manual data curation is expensive and time-consuming, we propose utilizing a generative-adversarial network (GAN) to augment low-resource language data. When training on a very small amount of language data (under 20,000 sentences) in a simulated low-resource setting, our model shows potential at data augmentation, generating monolingual language data with sentences such as "ask me that healthy lunch im cooking up," and "my grandfather work harder than your grandfather before." Our novel data augmentation approach takes the first step in investigating the capability of GANs in low-resource NMT, and our results suggest that there is promise for future extension of GANs to low-resource NMT.
Paper Structure (27 sections, 4 figures, 4 tables)

This paper contains 27 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overall Workflow
  • Figure 2: Model Architectures
  • Figure 3: Accuracy and Loss of the Encoder-Decoder during Training
  • Figure 4: Loss of the GAN