Table of Contents
Fetching ...

Imputer: Sequence Modelling via Imputation and Dynamic Programming

William Chan, Chitwan Saharia, Geoffrey Hinton, Mohammad Norouzi, Navdeep Jaitly

TL;DR

The paper tackles efficient long-output sequence modeling by introducing the Imputer, an iterative imputing sequence model that generates alignments in a fixed number of steps. It combines CTC-inspired marginalization with a dynamic programming training objective, encompassing roll-in and masking distributions to train a model that can interpolate between fully autoregressive and fully non-autoregressive generation. The DP-based training yields a tighter lower bound on the log-likelihood than imitation learning alone, and the approach achieves state-of-the-art-like results on LibriSpeech test-other (11.1 WER) and strong improvements over CTC on WSJ. This holds promise for fast, context-rich sequence modeling in speech recognition and other monotonic alignment problems, offering a practical balance between speed and accuracy.

Abstract

This paper presents the Imputer, a neural sequence model that generates output sequences iteratively via imputations. The Imputer is an iterative generative model, requiring only a constant number of generation steps independent of the number of input or output tokens. The Imputer can be trained to approximately marginalize over all possible alignments between the input and output sequences, and all possible generation orders. We present a tractable dynamic programming training algorithm, which yields a lower bound on the log marginal likelihood. When applied to end-to-end speech recognition, the Imputer outperforms prior non-autoregressive models and achieves competitive results to autoregressive models. On LibriSpeech test-other, the Imputer achieves 11.1 WER, outperforming CTC at 13.0 WER and seq2seq at 12.5 WER.

Imputer: Sequence Modelling via Imputation and Dynamic Programming

TL;DR

The paper tackles efficient long-output sequence modeling by introducing the Imputer, an iterative imputing sequence model that generates alignments in a fixed number of steps. It combines CTC-inspired marginalization with a dynamic programming training objective, encompassing roll-in and masking distributions to train a model that can interpolate between fully autoregressive and fully non-autoregressive generation. The DP-based training yields a tighter lower bound on the log-likelihood than imitation learning alone, and the approach achieves state-of-the-art-like results on LibriSpeech test-other (11.1 WER) and strong improvements over CTC on WSJ. This holds promise for fast, context-rich sequence modeling in speech recognition and other monotonic alignment problems, offering a practical balance between speed and accuracy.

Abstract

This paper presents the Imputer, a neural sequence model that generates output sequences iteratively via imputations. The Imputer is an iterative generative model, requiring only a constant number of generation steps independent of the number of input or output tokens. The Imputer can be trained to approximately marginalize over all possible alignments between the input and output sequences, and all possible generation orders. We present a tractable dynamic programming training algorithm, which yields a lower bound on the log marginal likelihood. When applied to end-to-end speech recognition, the Imputer outperforms prior non-autoregressive models and achieves competitive results to autoregressive models. On LibriSpeech test-other, the Imputer achieves 11.1 WER, outperforming CTC at 13.0 WER and seq2seq at 12.5 WER.

Paper Structure

This paper contains 19 sections, 11 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Visualization of the Imputer's decoding procedure. For this example, the alignment comprises $4$ blocks with block size of $B=3$ tokens each. The alignment "A B _ _ _ C D E _ _ _ F" is being imputed, with 1 token per block imputed at each decoding iteration. The decoding process takes exactly $B$ iterations.
  • Figure 2: Visualization of the Imputer dynamic programming training procedure. A roll-in policy is used to sample a masked-out alignment $\tilde{a}$. Imputer marginalizes over all compatible alignments with $\tilde{a}$ over the masked-out regions.
  • Figure 3: Visualization of the Imputer architecture. Imputer conditions on the previous alignment $\tilde{a}$, a convolutional network processes the audio features $x$, and a Transformer self-attention stack consolidates all available context to generate a new alignment.
  • Figure 4: Example Imputer inference from the LibriSpeech dev set with block size $B=8$ for the target sequence "[HAVE [TO [LIVE [WITH [MYSELF" with generated alignment "_ _ _ [HAVE _ _ [TO _ _ [L _ _ IVE _ [WITH _ _ [M _ Y _ _ [SE _ _ _ L _ F _ _ _ ". Imputer takes exactly block size number of generation steps ($B=8$) to generate the entire sequence, independent of the output sequence length. At each $b$ iteration, one token is filled in within each block and with parallel generation across blocks.
  • Figure 5: LibriSpeech dev-other WER with different block size used for training/inference and decoding strategy.
  • ...and 1 more figures