Right Label Context in End-to-End Training of Time-Synchronous ASR Models
Tina Raissi, Ralf Schlüter, Hermann Ney
TL;DR
This work investigates incorporating right label context into full-sum training for time-synchronous ASR models, identifying normalization challenges in discriminative gradients and removing them via a factored, triphone-based approach. It introduces a factored hybrid HMM with auxiliary left and right label factors that sums over all alignments, and demonstrates that right-context modeling yields gains, particularly when data are scarce. The paper also shows that an end-to-end, full-sum trained factored hybrid HMM without external alignments can achieve competitive performance relative to multi-stage pipelines, validated on Switchboard 300h and LibriSpeech 960h. Overall, the findings highlight the practical value of right-context conditioning for robust, simpler ASR systems and extend the feasibility of end-to-end training within hybrid HMM frameworks.
Abstract
Current time-synchronous sequence-to-sequence automatic speech recognition (ASR) models are trained by using sequence level cross-entropy that sums over all alignments. Due to the discriminative formulation, incorporating the right label context into the training criterion's gradient causes normalization problems and is not mathematically well-defined. The classic hybrid neural network hidden Markov model (NN-HMM) with its inherent generative formulation enables conditioning on the right label context. However, due to the HMM state-tying the identity of the right label context is never modeled explicitly. In this work, we propose a factored loss with auxiliary left and right label contexts that sums over all alignments. We show that the inclusion of the right label context is particularly beneficial when training data resources are limited. Moreover, we also show that it is possible to build a factored hybrid HMM system by relying exclusively on the full-sum criterion. Experiments were conducted on Switchboard 300h and LibriSpeech 960h.
