Table of Contents
Fetching ...

Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition

Hyeonseung Lee, Ji Won Yoon, Sungsoo Kim, Nam Soo Kim

TL;DR

This work introduces a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC), and presents its estimator, FoCCE, as a solution to estimate the exact likelihood.

Abstract

Transducer neural networks have emerged as the mainstream approach for streaming automatic speech recognition (ASR), offering state-of-the-art performance in balancing accuracy and latency. In the conventional framework, streaming transducer models are trained to maximize the likelihood function based on non-streaming recursion rules. However, this approach leads to a mismatch between training and inference, resulting in the issue of deformed likelihood and consequently suboptimal ASR accuracy. We introduce a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC). We also present its estimator, FoCCE, as a solution to estimate the exact likelihood. Through experiments on the LibriSpeech dataset, we show that FoCCE training improves the accuracy of the streaming transducers.

Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition

TL;DR

This work introduces a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC), and presents its estimator, FoCCE, as a solution to estimate the exact likelihood.

Abstract

Transducer neural networks have emerged as the mainstream approach for streaming automatic speech recognition (ASR), offering state-of-the-art performance in balancing accuracy and latency. In the conventional framework, streaming transducer models are trained to maximize the likelihood function based on non-streaming recursion rules. However, this approach leads to a mismatch between training and inference, resulting in the issue of deformed likelihood and consequently suboptimal ASR accuracy. We introduce a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC). We also present its estimator, FoCCE, as a solution to estimate the exact likelihood. Through experiments on the LibriSpeech dataset, we show that FoCCE training improves the accuracy of the streaming transducers.

Paper Structure

This paper contains 13 sections, 20 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: An illustration of the proposed FoCCE training. The streaming transducer network (left) and the FoCCE network (middle) respectively estimate the local probabilities and FoCC values, which are used to estimate the actual likelihood by the modified forward variable recursion rule (red boxes on the right).