CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR

Wenbo Zhao; Ziwei Li; Chuan Yu; Zhijian Ou

CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR

Wenbo Zhao, Ziwei Li, Chuan Yu, Zhijian Ou

TL;DR

CUSIDE-T is presented, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture, and achieves superior accuracy performance for streaming ASR, with equal settings of latency.

Abstract

Streaming automatic speech recognition (ASR) is very important for many real-world ASR applications. However, a notable challenge for streaming ASR systems lies in balancing operational performance against latency constraint. Recently, a method of chunking, simulating future context and decoding, called CUSIDE, has been proposed for connectionist temporal classification (CTC) based streaming ASR, which obtains a good balance between reduced latency and high recognition accuracy. In this paper, we present CUSIDE-T, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture. We also incorporate language model rescoring in CUSIDE-T to further enhance accuracy, while only bringing a small additional latency. Extensive experiments are conducted over the AISHELL-1, WenetSpeech and SpeechIO datasets, comparing CUSIDE-T and U2++ (both based on RNN-T). U2++ is an existing counterpart of chunk based streaming ASR method. It is shown that CUSIDE-T achieves superior accuracy performance for streaming ASR, with equal settings of latency.

CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR

TL;DR

Abstract

Paper Structure (16 sections, 5 equations, 2 figures, 2 tables)

This paper contains 16 sections, 5 equations, 2 figures, 2 tables.

Introduction
Background and Related Work
RNN-T model
Chunk-based training
Unified streaming and non-streaming recognition
Decoding in streaming ASR
CUSIDE-T
Chunking and context simulation
Multi-objective training
External language model fusion
Experiments
Datasets
Experimental Setup
AISHELL-1 Task
WenetSpeech Task
...and 1 more sections

Figures (2)

Figure 1: Overview of CUSIDE-T. The encoder, predictor and joiner are the network components of RNN-T. Input spectrum features are first split into overlapped chunks, then concatenated with a simulated right context predicted by the simulate network (SimuNet). The target label sequence $\mathbf{y}$ is fed into the predictor. Hidden encoded states, where those from the left and right contexts are eliminated, are spliced with the output of the predictor to fed to the joiner to calculate the conditional probability of labels.
Figure 2: The CER comparison of CUSIDE-T and U2++ on the SpeechIO-{01-18} benchmarking data with different decoding manners. The chunk size is 400ms.

CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR

TL;DR

Abstract

CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR

Authors

TL;DR

Abstract

Table of Contents

Figures (2)