Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition

Hyeonseung Lee; Ji Won Yoon; Sungsoo Kim; Nam Soo Kim

Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition

Hyeonseung Lee, Ji Won Yoon, Sungsoo Kim, Nam Soo Kim

TL;DR

This work introduces a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC), and presents its estimator, FoCCE, as a solution to estimate the exact likelihood.

Abstract

Transducer neural networks have emerged as the mainstream approach for streaming automatic speech recognition (ASR), offering state-of-the-art performance in balancing accuracy and latency. In the conventional framework, streaming transducer models are trained to maximize the likelihood function based on non-streaming recursion rules. However, this approach leads to a mismatch between training and inference, resulting in the issue of deformed likelihood and consequently suboptimal ASR accuracy. We introduce a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC). We also present its estimator, FoCCE, as a solution to estimate the exact likelihood. Through experiments on the LibriSpeech dataset, we show that FoCCE training improves the accuracy of the streaming transducers.

Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition

TL;DR

Abstract

Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)