Latent Sequence Decompositions
William Chan, Yu Zhang, Quoc Le, Navdeep Jaitly
TL;DR
The paper introduces Latent Sequence Decompositions (LSD), a framework that learns a distribution over variable-length output tokenizations (word pieces) that depend on both the input and the target sequence, addressing limitations of fixed decompositions in end-to-end seq2seq models. It defines a probabilistic model over latent decompositions z with a collapsing function to recover y, and trains via sampling valid extensions with an ε-greedy strategy coupled with an approximate gradient estimator. Decoding uses left-to-right beam search to find the best latent path and collapses it to the final output, avoiding explicit marginalization over all decompositions. On WSJ ASR, LSD substantially improves WER over a character baseline (12.9% vs 14.8%), and further improves to 9.6% when combined with a CNN encoder, demonstrating effective learning of input-dependent, variable-length tokenizations without external language models.
Abstract
We present the Latent Sequence Decompositions (LSD) framework. LSD decomposes sequences with variable lengthed output units as a function of both the input sequence and the output sequence. We present a training algorithm which samples valid extensions and an approximate decoding algorithm. We experiment with the Wall Street Journal speech recognition task. Our LSD model achieves 12.9% WER compared to a character baseline of 14.8% WER. When combined with a convolutional network on the encoder, we achieve 9.6% WER.
