Advanced Long-Content Speech Recognition With Factorized Neural Transducer

Xun Gong; Yu Wu; Jinyu Li; Shujie Liu; Rui Zhao; Xie Chen; Yanmin Qian

Advanced Long-Content Speech Recognition With Factorized Neural Transducer

Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

TL;DR

Two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming and streaming scenarios, are proposed, highlighting the significance of considering long-content speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems.

Abstract

In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that the vanilla C-T models do not exhibit improved performance when utilizing long-content transcriptions, possibly due to the predictor network of C-T models not functioning as a pure language model. Instead, FNT shows its potential in utilizing long-content information, where we propose the LongFNT model and explore the impact of long-content information in both text (LongFNT-Text) and speech (LongFNT-Speech). The proposed LongFNT-Text and LongFNT-Speech models further complement each other to achieve better performance, with transcription history proving more valuable to the model. The effectiveness of our LongFNT approach is evaluated on LibriSpeech and GigaSpeech corpora, and obtains relative 19% and 12% word error rate reduction, respectively. Furthermore, we extend the LongFNT model to the streaming scenario, which is named SLongFNT , consisting of SLongFNT-Text and SLongFNT-Speech approaches to utilize long-content text and speech information. Experiments show that the proposed SLongFNT model achieves relative 26% and 17% WER reduction on LibriSpeech and GigaSpeech respectively while keeping a good latency, compared to the FNT baseline. Overall, our proposed LongFNT and SLongFNT highlight the significance of considering long-content speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems.

Advanced Long-Content Speech Recognition With Factorized Neural Transducer

TL;DR

Abstract

Paper Structure (26 sections, 15 equations, 8 figures, 7 tables)

This paper contains 26 sections, 15 equations, 8 figures, 7 tables.

Introduction
Revisit on Neural Transducer
Transformer-based Neural Transducers
Factorized Neural Transducer
LongFNT: Long-content Factorized Neural Transducer ASR
LongFNT-Text: Long-content Text Integration of FNT
LongFNT-Speech: Long-content Enhanced Speech Encoder
Training strategies for LongFNT
SLongFNT: Speed up LongFNT ASR in streaming scenario
SLongFNT-Text
SLongFNT-Speech
Training strategies for SLongFNT
Experimental Setup
Experimental Results and Analysis
Evaluation on the non-streaming LongFNT model
...and 11 more sections

Figures (8)

Figure 1: Illustration of factorized neural transducer (FNT) and its improved version zhaorui.
Figure 2: Architecture of LongFNT-Text: the text-side context encoder and long-content textual integration methods for $\text{Pred}^V$
Figure 3: Architecture of LongFNT-Speech: the speech encoder pipeline with long-content speech input. Only the yellow part is used for gradient back propagation.
Figure 4: Architecture of SLongFNT-Text: the long-content self-attention module and two different kinds of historical textual information.
Figure 5: Architecture of the attention layer in SLongFNT-Speech: the long-content chunk-based self-attention module in the speech encoder. Downsampling is applied to reduce the history length.
...and 3 more figures

Advanced Long-Content Speech Recognition With Factorized Neural Transducer

TL;DR

Abstract

Advanced Long-Content Speech Recognition With Factorized Neural Transducer

Authors

TL;DR

Abstract

Table of Contents

Figures (8)