Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation

Linda Zeng

Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation

Linda Zeng

TL;DR

The paper addresses neural machine translation for low-resource languages by proposing a generative-adversarial network (GAN) based data augmentation framework that generates monolingual low-resource language data to improve training. The approach trains an encoder-decoder on parallel data, freezes it, and uses a GAN to map random noise to latent-space embeddings that the decoder can transform into augmented target-language sentences. In a simulated setup with $20{,}000$ training pairs, the encoder-decoder achieves 69.3% test accuracy while the GAN can produce coherent sentences, albeit with repetition and grammatical inconsistencies, indicating promising potential with room for improvement. The work demonstrates the feasibility of GAN-based data augmentation for low-resource NMT and outlines concrete directions for enhancing sentence quality, evaluating impact on end-to-end MT, and extending to parallel data.

Abstract

Neural Machine Translation (NMT) systems struggle when translating to and from low-resource languages, which lack large-scale data corpora for models to use for training. As manual data curation is expensive and time-consuming, we propose utilizing a generative-adversarial network (GAN) to augment low-resource language data. When training on a very small amount of language data (under 20,000 sentences) in a simulated low-resource setting, our model shows potential at data augmentation, generating monolingual language data with sentences such as "ask me that healthy lunch im cooking up," and "my grandfather work harder than your grandfather before." Our novel data augmentation approach takes the first step in investigating the capability of GANs in low-resource NMT, and our results suggest that there is promise for future extension of GANs to low-resource NMT.

Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation

TL;DR

training pairs, the encoder-decoder achieves 69.3% test accuracy while the GAN can produce coherent sentences, albeit with repetition and grammatical inconsistencies, indicating promising potential with room for improvement. The work demonstrates the feasibility of GAN-based data augmentation for low-resource NMT and outlines concrete directions for enhancing sentence quality, evaluating impact on end-to-end MT, and extending to parallel data.

Abstract

Paper Structure (27 sections, 4 figures, 4 tables)

This paper contains 27 sections, 4 figures, 4 tables.

Introduction
Related Work
Preliminaries on NMT
Data Augmentation for Low-Resource NMT
Preliminaries on GANs
GANs in NLP
Model Architecture
Overall Workflow
Underlying Architectures
Data
Simulated Low-Resource Setting
Training Data
Test Data
Preprocessing
Results
...and 12 more sections

Figures (4)

Figure 1: Overall Workflow
Figure 2: Model Architectures
Figure 3: Accuracy and Loss of the Encoder-Decoder during Training
Figure 4: Loss of the GAN

Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation

TL;DR

Abstract

Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)