Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

Hainan Xu; Zhehuai Chen; Fei Jia; Boris Ginsburg

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

Hainan Xu, Zhehuai Chen, Fei Jia, Boris Ginsburg

TL;DR

The paper addresses the suboptimal treatment of pronunciation information in end-to-end Transducer-based ASR by introducing Transducers with Pronunciation-aware Embeddings (PET). PET encodes pronunciation knowledge into the decoder embeddings via a multi-feature embedding composition, enabling shared parameters among tokens with similar pronunciations, exemplified by the formulation $E_{ ext{FINAL}, F}(v) = \sum_{f \in F} E_f(f(v))$. Across Mandarin Chinese and Korean datasets, PET yields consistent CER improvements and reveals that PET mitigates error chain reactions, reducing the likelihood that an early error triggers subsequent errors. The authors provide practical guidance on PET configuration and plan to release the implementation in the NeMo toolkit, enabling broader adoption and further exploration of pronunciation-aware modeling in ASR.

Abstract

This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

TL;DR

. Across Mandarin Chinese and Korean datasets, PET yields consistent CER improvements and reveals that PET mitigates error chain reactions, reducing the likelihood that an early error triggers subsequent errors. The authors provide practical guidance on PET configuration and plan to release the implementation in the NeMo toolkit, enabling broader adoption and further exploration of pronunciation-aware modeling in ASR.

Abstract

Paper Structure (13 sections, 4 equations, 2 figures, 7 tables)

This paper contains 13 sections, 4 equations, 2 figures, 7 tables.

Introduction
Background and Motivation for PET
Pronunciation-aware Embeddings
Experiments
Mandarin Chinese experiments
Korean experiments
Discussions
Pronunciation Information in Joiner Embeddings
Error Chain Reactions
PET Models Suppress Error Chain Reactions
Recommended PET Configs and Impact on Model Size
Conclusion
Acknowledgments

Figures (2)

Figure 1: Transducer model architecture.
Figure 2: A Chart showing homophone distributions in Mandarin Chinese, where a bar $(x, y)$ means the number of different pronunciations shared by exactly $x$ characters is $y$. E.g. there are just below 300 characters that have unique pronunciations; slightly above 200 pronunciations are shared by two Chinese characters. On the right-hand side of the chart, we see certain pronunciations can be shared among as many as 43 characters.

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

TL;DR

Abstract

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)