Table of Contents
Fetching ...

Improving Non-autoregressive Machine Translation with Error Exposure and Consistency Regularization

Xinran Chen, Sufeng Duan, Gongshen Liu

TL;DR

This paper constructs the mixed sequences based on model prediction during training, and proposes to optimize over the masked tokens under imperfect observation conditions, and design a consistency learning method to constrain the data distribution under different observing situations to narrow down the gap between training and inference.

Abstract

Being one of the IR-NAT (Iterative-refinemennt-based NAT) frameworks, the Conditional Masked Language Model (CMLM) adopts the mask-predict paradigm to re-predict the masked low-confidence tokens. However, CMLM suffers from the data distribution discrepancy between training and inference, where the observed tokens are generated differently in the two cases. In this paper, we address this problem with the training approaches of error exposure and consistency regularization (EECR). We construct the mixed sequences based on model prediction during training, and propose to optimize over the masked tokens under imperfect observation conditions. We also design a consistency learning method to constrain the data distribution for the masked tokens under different observing situations to narrow down the gap between training and inference. The experiments on five translation benchmarks obtains an average improvement of 0.68 and 0.40 BLEU scores compared to the base models, respectively, and our CMLMC-EECR achieves the best performance with a comparable translation quality with the Transformer. The experiments results demonstrate the effectiveness of our method.

Improving Non-autoregressive Machine Translation with Error Exposure and Consistency Regularization

TL;DR

This paper constructs the mixed sequences based on model prediction during training, and proposes to optimize over the masked tokens under imperfect observation conditions, and design a consistency learning method to constrain the data distribution under different observing situations to narrow down the gap between training and inference.

Abstract

Being one of the IR-NAT (Iterative-refinemennt-based NAT) frameworks, the Conditional Masked Language Model (CMLM) adopts the mask-predict paradigm to re-predict the masked low-confidence tokens. However, CMLM suffers from the data distribution discrepancy between training and inference, where the observed tokens are generated differently in the two cases. In this paper, we address this problem with the training approaches of error exposure and consistency regularization (EECR). We construct the mixed sequences based on model prediction during training, and propose to optimize over the masked tokens under imperfect observation conditions. We also design a consistency learning method to constrain the data distribution for the masked tokens under different observing situations to narrow down the gap between training and inference. The experiments on five translation benchmarks obtains an average improvement of 0.68 and 0.40 BLEU scores compared to the base models, respectively, and our CMLMC-EECR achieves the best performance with a comparable translation quality with the Transformer. The experiments results demonstrate the effectiveness of our method.
Paper Structure (40 sections, 11 equations, 4 figures, 13 tables, 1 algorithm)

This paper contains 40 sections, 11 equations, 4 figures, 13 tables, 1 algorithm.

Figures (4)

  • Figure 1: The overview of the our EECR strategy. The left part illustrates the sequence prediction process of the mixed sequence generation. The decoder refines the predicted sequence $\hat{Y}$ based on the sequence of the former step $\hat{Y}_{Prev}$ (as shown by the blue dotted line arrows) by $k$ times. Subsequently, the partially masked ground truth sequences are randomly substituted with the predicted tokens $\hat{y}_1$ and $\hat{y}_5$ (as shown by the dashed arrows) and we get the mixed sequences $Y^{1}$ and $Y^{2}$. The right part depicts the consistency learning process. The probability distributions of the masked tokens $\texttt{[M]}$ under the ground truth and mixed sequences are constrained by the consistency regularization (as shown by the bidirectional arrows).
  • Figure 2: The cosine similarity of masked token representations under different observing scenarios of CMLM-EECR and CMLM.
  • Figure 3: The training curves of CMLM and CMLM-EECR in IWSLT14 DE$\rightarrow$EN valid set. The inference iteration number is set to 1.
  • Figure 4: Translation quality on WMT16 EN$\rightarrow$RO test set over the sentence groups of different lengths.