Table of Contents
Fetching ...

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

Hainan Xu, Zhehuai Chen, Fei Jia, Boris Ginsburg

TL;DR

The paper addresses the suboptimal treatment of pronunciation information in end-to-end Transducer-based ASR by introducing Transducers with Pronunciation-aware Embeddings (PET). PET encodes pronunciation knowledge into the decoder embeddings via a multi-feature embedding composition, enabling shared parameters among tokens with similar pronunciations, exemplified by the formulation $E_{ ext{FINAL}, F}(v) = \sum_{f \in F} E_f(f(v))$. Across Mandarin Chinese and Korean datasets, PET yields consistent CER improvements and reveals that PET mitigates error chain reactions, reducing the likelihood that an early error triggers subsequent errors. The authors provide practical guidance on PET configuration and plan to release the implementation in the NeMo toolkit, enabling broader adoption and further exploration of pronunciation-aware modeling in ASR.

Abstract

This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

TL;DR

The paper addresses the suboptimal treatment of pronunciation information in end-to-end Transducer-based ASR by introducing Transducers with Pronunciation-aware Embeddings (PET). PET encodes pronunciation knowledge into the decoder embeddings via a multi-feature embedding composition, enabling shared parameters among tokens with similar pronunciations, exemplified by the formulation . Across Mandarin Chinese and Korean datasets, PET yields consistent CER improvements and reveals that PET mitigates error chain reactions, reducing the likelihood that an early error triggers subsequent errors. The authors provide practical guidance on PET configuration and plan to release the implementation in the NeMo toolkit, enabling broader adoption and further exploration of pronunciation-aware modeling in ASR.

Abstract

This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.
Paper Structure (13 sections, 4 equations, 2 figures, 7 tables)

This paper contains 13 sections, 4 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Transducer model architecture.
  • Figure 2: A Chart showing homophone distributions in Mandarin Chinese, where a bar $(x, y)$ means the number of different pronunciations shared by exactly $x$ characters is $y$. E.g. there are just below 300 characters that have unique pronunciations; slightly above 200 pronunciations are shared by two Chinese characters. On the right-hand side of the chart, we see certain pronunciations can be shared among as many as 43 characters.