ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation
Lifu Tu, Richard Yuanzhe Pang, Sam Wiseman, Kevin Gimpel
TL;DR
ENGINE presents an energy-based framework for non-autoregressive machine translation by training an inference network to minimize a pretrained autoregressive energy $E_\Theta(\boldsymbol{x},\boldsymbol{y})$, enabling end-to-end gradient optimization over distributions of target tokens. By reformulating the target as a sequence of word distributions and introducing operator-based interfaces $\mathbf{O_1}$ and $\mathbf{O_2}$, the method leverages differentiable relaxations to approximate $\arg\min_{\boldsymbol{y}} E_\Theta(\boldsymbol{x},\boldsymbol{y})$ with fast, non-autoregressive decoding. Across IWSLT14 DE-EN and WMT16 RO-EN, ENGINE achieves state-of-the-art NAT performance, with BLEU scores of $31.99$ and $33.16$ respectively at 1 iteration and further gains with refinement iterations, closely approaching autoregressive teachers. The results demonstrate that energy-based training can outperform distillation-based NAT, offering a scalable path for high-quality non-autoregressive generation and broader applicability to energy-based generation tasks.
Abstract
We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.
