Table of Contents
Fetching ...

Latent Sequence Decompositions

William Chan, Yu Zhang, Quoc Le, Navdeep Jaitly

TL;DR

The paper introduces Latent Sequence Decompositions (LSD), a framework that learns a distribution over variable-length output tokenizations (word pieces) that depend on both the input and the target sequence, addressing limitations of fixed decompositions in end-to-end seq2seq models. It defines a probabilistic model over latent decompositions z with a collapsing function to recover y, and trains via sampling valid extensions with an ε-greedy strategy coupled with an approximate gradient estimator. Decoding uses left-to-right beam search to find the best latent path and collapses it to the final output, avoiding explicit marginalization over all decompositions. On WSJ ASR, LSD substantially improves WER over a character baseline (12.9% vs 14.8%), and further improves to 9.6% when combined with a CNN encoder, demonstrating effective learning of input-dependent, variable-length tokenizations without external language models.

Abstract

We present the Latent Sequence Decompositions (LSD) framework. LSD decomposes sequences with variable lengthed output units as a function of both the input sequence and the output sequence. We present a training algorithm which samples valid extensions and an approximate decoding algorithm. We experiment with the Wall Street Journal speech recognition task. Our LSD model achieves 12.9% WER compared to a character baseline of 14.8% WER. When combined with a convolutional network on the encoder, we achieve 9.6% WER.

Latent Sequence Decompositions

TL;DR

The paper introduces Latent Sequence Decompositions (LSD), a framework that learns a distribution over variable-length output tokenizations (word pieces) that depend on both the input and the target sequence, addressing limitations of fixed decompositions in end-to-end seq2seq models. It defines a probabilistic model over latent decompositions z with a collapsing function to recover y, and trains via sampling valid extensions with an ε-greedy strategy coupled with an approximate gradient estimator. Decoding uses left-to-right beam search to find the best latent path and collapses it to the final output, avoiding explicit marginalization over all decompositions. On WSJ ASR, LSD substantially improves WER over a character baseline (12.9% vs 14.8%), and further improves to 9.6% when combined with a CNN encoder, demonstrating effective learning of input-dependent, variable-length tokenizations without external language models.

Abstract

We present the Latent Sequence Decompositions (LSD) framework. LSD decomposes sequences with variable lengthed output units as a function of both the input sequence and the output sequence. We present a training algorithm which samples valid extensions and an approximate decoding algorithm. We experiment with the Wall Street Journal speech recognition task. Our LSD model achieves 12.9% WER compared to a character baseline of 14.8% WER. When combined with a convolutional network on the encoder, we achieve 9.6% WER.

Paper Structure

This paper contains 8 sections, 8 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Distribution of the characters covered by the n-grams of the word piece models. We train Latent Sequence Decompositions (LSD) and Maximum Extension (MaxExt) models with $n \in \{2,3,4,5\}$ sized word piece vocabulary and measure the distribution of the characters covered by the word pieces. The bars with the solid fill represents the LSD models, and the bars with the star hatch fill represents the MaxExt models. Both the LSD and MaxExt models prefer to use $n \geq 2$ sized word pieces to cover the majority of the characters. The MaxExt models prefers longer word pieces to cover characters compared to the LSD models.