Table of Contents
Fetching ...

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

Hagen Soltau, Hank Liao, Hasim Sak

TL;DR

This work tackles large-vocabulary end-to-end speech recognition by modeling words directly with a bidirectional LSTM and a CTC loss, trained on 125,000 hours of semi-supervised YouTube captions. The Neural Speech Recognizer (NSR) eliminates the need for pronunciation lexica and decoding, outputting word posteriors directly and supporting both spoken and written vocabularies. NSR achieves competitive WERs, notably 11.6% in the spoken-domain with LM and 13.4% in the written-domain with LM, outperforming a strong conventional CD-phone baseline under the same conditions. The results demonstrate that large-scale word-level CTC models can deliver robust end-to-end ASR performance in real-world, diverse domains while simplifying the overall system architecture.

Abstract

We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex, state-of-the-art baseline with sub-word units.

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

TL;DR

This work tackles large-vocabulary end-to-end speech recognition by modeling words directly with a bidirectional LSTM and a CTC loss, trained on 125,000 hours of semi-supervised YouTube captions. The Neural Speech Recognizer (NSR) eliminates the need for pronunciation lexica and decoding, outputting word posteriors directly and supporting both spoken and written vocabularies. NSR achieves competitive WERs, notably 11.6% in the spoken-domain with LM and 13.4% in the written-domain with LM, outperforming a strong conventional CD-phone baseline under the same conditions. The results demonstrate that large-scale word-level CTC models can deliver robust end-to-end ASR performance in real-world, diverse domains while simplifying the overall system architecture.

Abstract

We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex, state-of-the-art baseline with sub-word units.

Paper Structure

This paper contains 6 sections, 2 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The word posterior probabilities as predicted by the NSR model at each time-frame (30 msec) for a segment of music video 'Stressed Out' by Twenty One Pilots. We only plot the word with highest posterior and the missing words from the correct transcription:'Sometimes a certain smell will take me back to when I was young, how come I'm never able to identify where it's coming from'.