Streaming Speech-to-Confusion Network Speech Recognition

Denis Filimonov; Prabhat Pandey; Ariya Rastrow; Ankur Gandhe; Andreas Stolcke

Streaming Speech-to-Confusion Network Speech Recognition

Denis Filimonov, Prabhat Pandey, Ariya Rastrow, Ankur Gandhe, Andreas Stolcke

TL;DR

A novel streaming ASR architecture is presented that outputs a confusion network while maintaining limited latency, as needed for interactive applications, and which outperforms a strong RNN-T baseline on a far-field voice assistant task.

Abstract

In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our model are on par with a comparable RNN-T system, while the richer hypothesis set allows second-pass rescoring to achieve 10-20\% lower word error rate on the LibriSpeech task. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.

Streaming Speech-to-Confusion Network Speech Recognition

TL;DR

Abstract

Paper Structure (6 sections, 6 equations, 3 figures, 4 tables)

This paper contains 6 sections, 6 equations, 3 figures, 4 tables.

Introduction
Related Work
Model Architecture
Rescoring Methods
Experiments
Conclusions

Figures (3)

Figure 1: Schematic diagram of the speech-to-confusion betwork (S2CN) model.
Figure 2: Segment start/end predictions relative to the closest word boundary. In the left plot, "Early 2" indicates the percentage of segments that start two frames earlier than the closest word start boundary.
Figure 3: Segment distribution and accuracy. The horizontal axis denotes the range of segment lengths in 30ms frames.

Streaming Speech-to-Confusion Network Speech Recognition

TL;DR

Abstract

Streaming Speech-to-Confusion Network Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)