Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Muhammad Shakeel; Yui Sudo; Yifan Peng; Shinji Watanabe

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

TL;DR

This work presents joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation based on multi-decoder and knowledge distillation between the two modular encoders and decoders.

Abstract

End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

TL;DR

Abstract

Paper Structure (5 sections, 13 equations, 2 figures, 3 tables)

This paper contains 5 sections, 13 equations, 2 figures, 3 tables.

Introduction
Joint optimization of ASR model
Experiments
Main results
Conclusion

Figures (2)

Figure 1: Joint optimization of multi-decoder ASR model: A single model with streaming (student) and non-streaming (teacher) modules, both of which are jointly optimized.
Figure 2: Comparative analysis using two knowledge distillation methods: mean square error-based encoder-side distillation (mse-ED) and similarity-preserving encoder-side distillation (sp-ED) (ours). Results are presented on two evaluation sets: test-clean, and test-other, for varying block sizes trained on Librispeech 100-hour dataset.

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

TL;DR

Abstract

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)