Table of Contents
Fetching ...

Streaming Speech-to-Confusion Network Speech Recognition

Denis Filimonov, Prabhat Pandey, Ariya Rastrow, Ankur Gandhe, Andreas Stolcke

TL;DR

A novel streaming ASR architecture is presented that outputs a confusion network while maintaining limited latency, as needed for interactive applications, and which outperforms a strong RNN-T baseline on a far-field voice assistant task.

Abstract

In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our model are on par with a comparable RNN-T system, while the richer hypothesis set allows second-pass rescoring to achieve 10-20\% lower word error rate on the LibriSpeech task. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.

Streaming Speech-to-Confusion Network Speech Recognition

TL;DR

A novel streaming ASR architecture is presented that outputs a confusion network while maintaining limited latency, as needed for interactive applications, and which outperforms a strong RNN-T baseline on a far-field voice assistant task.

Abstract

In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our model are on par with a comparable RNN-T system, while the richer hypothesis set allows second-pass rescoring to achieve 10-20\% lower word error rate on the LibriSpeech task. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.
Paper Structure (6 sections, 6 equations, 3 figures, 4 tables)

This paper contains 6 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Schematic diagram of the speech-to-confusion betwork (S2CN) model.
  • Figure 2: Segment start/end predictions relative to the closest word boundary. In the left plot, "Early 2" indicates the percentage of segments that start two frames earlier than the closest word start boundary.
  • Figure 3: Segment distribution and accuracy. The horizontal axis denotes the range of segment lengths in 30ms frames.