Table of Contents
Fetching ...

ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

Lifu Tu, Richard Yuanzhe Pang, Sam Wiseman, Kevin Gimpel

TL;DR

ENGINE presents an energy-based framework for non-autoregressive machine translation by training an inference network to minimize a pretrained autoregressive energy $E_\Theta(\boldsymbol{x},\boldsymbol{y})$, enabling end-to-end gradient optimization over distributions of target tokens. By reformulating the target as a sequence of word distributions and introducing operator-based interfaces $\mathbf{O_1}$ and $\mathbf{O_2}$, the method leverages differentiable relaxations to approximate $\arg\min_{\boldsymbol{y}} E_\Theta(\boldsymbol{x},\boldsymbol{y})$ with fast, non-autoregressive decoding. Across IWSLT14 DE-EN and WMT16 RO-EN, ENGINE achieves state-of-the-art NAT performance, with BLEU scores of $31.99$ and $33.16$ respectively at 1 iteration and further gains with refinement iterations, closely approaching autoregressive teachers. The results demonstrate that energy-based training can outperform distillation-based NAT, offering a scalable path for high-quality non-autoregressive generation and broader applicability to energy-based generation tasks.

Abstract

We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.

ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

TL;DR

ENGINE presents an energy-based framework for non-autoregressive machine translation by training an inference network to minimize a pretrained autoregressive energy , enabling end-to-end gradient optimization over distributions of target tokens. By reformulating the target as a sequence of word distributions and introducing operator-based interfaces and , the method leverages differentiable relaxations to approximate with fast, non-autoregressive decoding. Across IWSLT14 DE-EN and WMT16 RO-EN, ENGINE achieves state-of-the-art NAT performance, with BLEU scores of and respectively at 1 iteration and further gains with refinement iterations, closely approaching autoregressive teachers. The results demonstrate that energy-based training can outperform distillation-based NAT, offering a scalable path for high-quality non-autoregressive generation and broader applicability to energy-based generation tasks.

Abstract

We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.

Paper Structure

This paper contains 20 sections, 5 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: The ENGINE framework trains a non-autoregressive inference network $\mathbf{A}_{\Psi}$ to produce translations with low energy under a pretrained autoregressive energy $E$.