Deterministic Reversible Data Augmentation for Neural Machine Translation

Jiashu Yao; Heyan Huang; Zeming Liu; Yuhang Guo

Deterministic Reversible Data Augmentation for Neural Machine Translation

Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo

TL;DR

Deterministic Reversible Data Augmentation (DRDA) tackles semantic inconsistency in data augmentation for neural machine translation by using deterministic, reversible multi-granularity subword segmentations and a multi-view training objective that aligns predictions across granularities. The method constructs multiple vocabularies (prime and augmented) and computes a combined loss that includes prime-source NLL, augmented-source NLL, and an agreement term to pull distributions together, enabling symbolically diverse yet semantically coherent augmentation. Empirical results across IWSLT, WMT, and TED tasks show DRDA consistently improves BLEU over strong transformers, with notable gains in low-resource and noisy domains, and analyses reveal improved semantic consistency and subword composition. DRDA’s lack of extra data requirements and model changes, along with its potential applicability to other segmentation-based tasks, suggests practical impact for robust NMT and related NLP domains, especially where rare or morphologically rich subwords are prevalent.

Abstract

Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.

Deterministic Reversible Data Augmentation for Neural Machine Translation

TL;DR

Abstract

Paper Structure (42 sections, 13 equations, 6 figures, 13 tables)

This paper contains 42 sections, 13 equations, 6 figures, 13 tables.

Introduction
Related Work
Augmentation methods
Subword regularization
Background: Subword Segmentation
Deterministic Reversible Data Augmentation
Multi-Granularity Segmentations
Multi-view Learning
Dynamic Selection of Granularity in Inference
Experiments
Experimental Setup
Datasets and preprocessing
Models
Hyperparameters in training and inferring
Evaluation
...and 27 more sections

Figures (6)

Figure 1: Subword piece sequences generated by previous data augmentation (A), subword regularization (B), and multi-granularity segmentation (C) representing the same source sentence. $\Box$ denotes an empty subword (a zero vector). Previous data augmentation methods result in semantic loss (red texts), subword regularization may sample inappropriate subwords (yellow texts), while multi-granularity segmentation generates symbolically diverse and semantically consistent augmentation data (green texts).
Figure 2: Illustration of the overall framework of DRDA. A source sentence is segmented into different granularities, and every generated token sequence will go through the model, obtaining a hypothesis distribution respectively. The agreement loss (blue segmented lines) will be computed between hypothesis distributions, and the negative likelihood loss (green dotted lines) will be computed between each distribution and the target.
Figure 3: Ablations on IWSLT De$\rightarrow$En over augmented vocabulary size (left) and agreement loss weight (right).
Figure 4: Most occurrences of "_nerv" are absorbed by "_nervous" when the vocabulary grows (left). The frequency drop rate of "_nerv" is $(121-6)/121 = 0.95$. The right figure shows all frequency drop rates on IWSLT En$\rightarrow$De sorted in descending order.
Figure 5: The similarity between the fine- and coarse-grained representations is computed by $\cos \theta$.
...and 1 more figures

Deterministic Reversible Data Augmentation for Neural Machine Translation

TL;DR

Abstract

Deterministic Reversible Data Augmentation for Neural Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)