Table of Contents
Fetching ...

CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR

Wenbo Zhao, Ziwei Li, Chuan Yu, Zhijian Ou

TL;DR

CUSIDE-T is presented, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture, and achieves superior accuracy performance for streaming ASR, with equal settings of latency.

Abstract

Streaming automatic speech recognition (ASR) is very important for many real-world ASR applications. However, a notable challenge for streaming ASR systems lies in balancing operational performance against latency constraint. Recently, a method of chunking, simulating future context and decoding, called CUSIDE, has been proposed for connectionist temporal classification (CTC) based streaming ASR, which obtains a good balance between reduced latency and high recognition accuracy. In this paper, we present CUSIDE-T, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture. We also incorporate language model rescoring in CUSIDE-T to further enhance accuracy, while only bringing a small additional latency. Extensive experiments are conducted over the AISHELL-1, WenetSpeech and SpeechIO datasets, comparing CUSIDE-T and U2++ (both based on RNN-T). U2++ is an existing counterpart of chunk based streaming ASR method. It is shown that CUSIDE-T achieves superior accuracy performance for streaming ASR, with equal settings of latency.

CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR

TL;DR

CUSIDE-T is presented, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture, and achieves superior accuracy performance for streaming ASR, with equal settings of latency.

Abstract

Streaming automatic speech recognition (ASR) is very important for many real-world ASR applications. However, a notable challenge for streaming ASR systems lies in balancing operational performance against latency constraint. Recently, a method of chunking, simulating future context and decoding, called CUSIDE, has been proposed for connectionist temporal classification (CTC) based streaming ASR, which obtains a good balance between reduced latency and high recognition accuracy. In this paper, we present CUSIDE-T, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture. We also incorporate language model rescoring in CUSIDE-T to further enhance accuracy, while only bringing a small additional latency. Extensive experiments are conducted over the AISHELL-1, WenetSpeech and SpeechIO datasets, comparing CUSIDE-T and U2++ (both based on RNN-T). U2++ is an existing counterpart of chunk based streaming ASR method. It is shown that CUSIDE-T achieves superior accuracy performance for streaming ASR, with equal settings of latency.
Paper Structure (16 sections, 5 equations, 2 figures, 2 tables)

This paper contains 16 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of CUSIDE-T. The encoder, predictor and joiner are the network components of RNN-T. Input spectrum features are first split into overlapped chunks, then concatenated with a simulated right context predicted by the simulate network (SimuNet). The target label sequence $\mathbf{y}$ is fed into the predictor. Hidden encoded states, where those from the left and right contexts are eliminated, are spliced with the output of the predictor to fed to the joiner to calculate the conditional probability of labels.
  • Figure 2: The CER comparison of CUSIDE-T and U2++ on the SpeechIO-{01-18} benchmarking data with different decoding manners. The chunk size is 400ms.