Table of Contents
Fetching ...

Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing

Gaofeng Cheng, Haitian Lu, Chengxu Yang, Xuyang Wang, Ta Li, Yonghong Yan

TL;DR

This paper introduces ATPC, a data-driven method to automatically derive text-pronunciation correlations using supervision from speech-text pairs, enabling lexicon-free pronunciation modeling. The pipeline combines ITSE-based text-speech alignment, pronunciation-embedding extraction from multilingual speech models, and DTW-based distance calculations to produce an ATPC matrix, demonstrated on Mandarin. Experiments show that ATPC improves contextual biasing in E2E-ASR, reducing CER and B-CER and boosting hotword recall and F1, while remaining applicable to dialects or languages without handcrafted lexicons. Limitations remain relative to manual lexicons, and future work includes multilingual extension, handling out-of-vocabulary items, larger data scales, and public resource development.

Abstract

Effectively distinguishing the pronunciation correlations between different written texts is a significant issue in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). The supervision required for this method is consistent with the supervision needed for training end-to-end automatic speech recognition (E2E-ASR) systems, i.e., speech and corresponding text annotations. First, the iteratively-trained timestamp estimator (ITSE) algorithm is employed to align the speech with their corresponding annotated text symbols. Then, a speech encoder is used to convert the speech into speech embeddings. Finally, we compare the speech embeddings distances of different text symbols to obtain ATPC. Experimental results on Mandarin show that ATPC enhances E2E-ASR performance in contextual biasing and holds promise for dialects or languages lacking artificial pronunciation lexicons.

Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing

TL;DR

This paper introduces ATPC, a data-driven method to automatically derive text-pronunciation correlations using supervision from speech-text pairs, enabling lexicon-free pronunciation modeling. The pipeline combines ITSE-based text-speech alignment, pronunciation-embedding extraction from multilingual speech models, and DTW-based distance calculations to produce an ATPC matrix, demonstrated on Mandarin. Experiments show that ATPC improves contextual biasing in E2E-ASR, reducing CER and B-CER and boosting hotword recall and F1, while remaining applicable to dialects or languages without handcrafted lexicons. Limitations remain relative to manual lexicons, and future work includes multilingual extension, handling out-of-vocabulary items, larger data scales, and public resource development.

Abstract

Effectively distinguishing the pronunciation correlations between different written texts is a significant issue in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). The supervision required for this method is consistent with the supervision needed for training end-to-end automatic speech recognition (E2E-ASR) systems, i.e., speech and corresponding text annotations. First, the iteratively-trained timestamp estimator (ITSE) algorithm is employed to align the speech with their corresponding annotated text symbols. Then, a speech encoder is used to convert the speech into speech embeddings. Finally, we compare the speech embeddings distances of different text symbols to obtain ATPC. Experimental results on Mandarin show that ATPC enhances E2E-ASR performance in contextual biasing and holds promise for dialects or languages lacking artificial pronunciation lexicons.
Paper Structure (15 sections, 1 equation, 3 figures, 2 tables)

This paper contains 15 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The overall diagram of generating ATPC, with c1, c2, and c3 represent multiple embeddings corresponding to the same character in the training dataset.
  • Figure 2: Pronunciation correlation calculation with DTW. V and W represent the speech embeddings of two different Mandarin characters. $D_{norm}$ is the pronunciation correlation between V and W. The alignment path is obtained by tracing backward through the DTW table, iteratively choosing the previous points with the lowest cumulative distance.
  • Figure 3: The visual analysis of generated ATPC matrix subset.