Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition
Hainan Xu, Zhehuai Chen, Fei Jia, Boris Ginsburg
TL;DR
The paper addresses the suboptimal treatment of pronunciation information in end-to-end Transducer-based ASR by introducing Transducers with Pronunciation-aware Embeddings (PET). PET encodes pronunciation knowledge into the decoder embeddings via a multi-feature embedding composition, enabling shared parameters among tokens with similar pronunciations, exemplified by the formulation $E_{ ext{FINAL}, F}(v) = \sum_{f \in F} E_f(f(v))$. Across Mandarin Chinese and Korean datasets, PET yields consistent CER improvements and reveals that PET mitigates error chain reactions, reducing the likelihood that an early error triggers subsequent errors. The authors provide practical guidance on PET configuration and plan to release the implementation in the NeMo toolkit, enabling broader adoption and further exploration of pronunciation-aware modeling in ASR.
Abstract
This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.
