Table of Contents
Fetching ...

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow

Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, Eduard Hovy

TL;DR

FlowSeq addresses the bottleneck of autoregressive decoding by introducing a non-autoregressive seq2seq model built on generative flows to model a complex latent prior over output tokens. It combines a source encoder, a variational posterior for latent variables, and a Glow-inspired flow prior with an entirely parallel decoder, enabling near-constant decoding time w.r.t. sequence length. Training relies on variational inference with ELBO, while decoding utilizes Argmax, Noisy Parallel Decoding, or Importance Weighted Decoding to approximate the intractable marginalization over latent variables. Empirically, FlowSeq achieves competitive BLEU on multiple MT benchmarks and offers substantial speedups in decoding, while analysis highlights the impact of sampling strategies and translation diversity, marking a practical step toward efficient, high-quality non-autoregressive generation.

Abstract

Most sequence-to-sequence (seq2seq) models are autoregressive; they generate each token by conditioning on previously generated tokens. In contrast, non-autoregressive seq2seq models generate all tokens in one pass, which leads to increased efficiency through parallel processing on hardware such as GPUs. However, directly modeling the joint distribution of all tokens simultaneously is challenging, and even with increasingly complex model structures accuracy lags significantly behind autoregressive models. In this paper, we propose a simple, efficient, and effective model for non-autoregressive sequence generation using latent variable models. Specifically, we turn to generative flow, an elegant technique to model complex distributions using neural networks, and design several layers of flow tailored for modeling the conditional density of sequential latent variables. We evaluate this model on three neural machine translation (NMT) benchmark datasets, achieving comparable performance with state-of-the-art non-autoregressive NMT models and almost constant decoding time w.r.t the sequence length.

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow

TL;DR

FlowSeq addresses the bottleneck of autoregressive decoding by introducing a non-autoregressive seq2seq model built on generative flows to model a complex latent prior over output tokens. It combines a source encoder, a variational posterior for latent variables, and a Glow-inspired flow prior with an entirely parallel decoder, enabling near-constant decoding time w.r.t. sequence length. Training relies on variational inference with ELBO, while decoding utilizes Argmax, Noisy Parallel Decoding, or Importance Weighted Decoding to approximate the intractable marginalization over latent variables. Empirically, FlowSeq achieves competitive BLEU on multiple MT benchmarks and offers substantial speedups in decoding, while analysis highlights the impact of sampling strategies and translation diversity, marking a practical step toward efficient, high-quality non-autoregressive generation.

Abstract

Most sequence-to-sequence (seq2seq) models are autoregressive; they generate each token by conditioning on previously generated tokens. In contrast, non-autoregressive seq2seq models generate all tokens in one pass, which leads to increased efficiency through parallel processing on hardware such as GPUs. However, directly modeling the joint distribution of all tokens simultaneously is challenging, and even with increasingly complex model structures accuracy lags significantly behind autoregressive models. In this paper, we propose a simple, efficient, and effective model for non-autoregressive sequence generation using latent variable models. Specifically, we turn to generative flow, an elegant technique to model complex distributions using neural networks, and design several layers of flow tailored for modeling the conditional density of sequential latent variables. We evaluate this model on three neural machine translation (NMT) benchmark datasets, achieving comparable performance with state-of-the-art non-autoregressive NMT models and almost constant decoding time w.r.t the sequence length.

Paper Structure

This paper contains 43 sections, 22 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) Autoregressive (b) non-autoregressive and (c) our proposed sequence generation models. $\mathbf{x}$ is the source, $\mathbf{y}$ is the target, and $\mathbf{z}$ are latent variables.
  • Figure 2: Neural architecture of FlowSeq, including the encoder, the decoder and the posterior networks, together with the multi-scale architecture of the prior flow. The architecture of each flow step is in Figure \ref{['fig:flowstep']}.
  • Figure 3: (a) The architecture of one step of our flow. (b) The visualization of three split pattern for coupling layers, where the red color denotes $\mathbf{z}_a$ and the blue color denotes $zv_b$. (c) The attention-based architecture of the NN function in coupling layers.
  • Figure 4: The decoding speed of the Transformer (batched, beam size 5) and FlowSeq on WMT14 EN-DE test set (a) w.r.t. different batch sizes (b) bucketed by different target sentence lengths (batch size 32).
  • Figure 5: Impact of sampling hyperparameters on the rescoring BLEU on the dev set of WMT14 DE-EN. Experiments are performed with FlowSeq-base trained with distillation data. $l$ is the number of length candidates. $r$ is the number of samples for each length.
  • ...and 2 more figures