Table of Contents
Fetching ...

Generative Annotation for ASR Named Entity Correction

Yuanchang Luo, Daimeng Wei, Shaojun Li, Hengchao Shang, Jiaxin Guo, Zongyao Li, Zhanglin Wu, Xiaoyu Chen, Zhiqiang Rao, Jinlong Yang, Hao Yang

TL;DR

This work tackles named-entity correction in end-to-end ASR by introducing a generative annotation approach that uses speech-sound features to retrieve candidate entities and a generative model to annotate and replace erroneous text. By replacing phonetic-based retrieval with audio-based candidate retrieval and coupling it with an end-to-end generative correction step, the method handles cases where word forms differ substantially from the ground-truth entities. Across AISHELL and a self-constructed BuzzWord set, the approach outperforms a strong phonetic-edit-distance baseline, achieving lower CER and higher NE-Recall, and remains effective with commercial ASR systems. The method also integrates entity rejection and demonstrates meaningful insights via attention analyses, offering a practical, generalizable solution for robust domain-specific entity transcription in ASR, with publicly available training data and code.

Abstract

End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities, causing catastrophic failures in downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when the forms of the wrongly-transcribed words(s) and the ground-truth entity are significantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entity errors in ASR transcripts and replace the text with correct entities. This method is effective in scenarios of word form difference. We test our method using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring significant improvement to entity accuracy. The self-constructed training data and test set is publicly available at github.com/L6-NLP/Generative-Annotation-NEC.

Generative Annotation for ASR Named Entity Correction

TL;DR

This work tackles named-entity correction in end-to-end ASR by introducing a generative annotation approach that uses speech-sound features to retrieve candidate entities and a generative model to annotate and replace erroneous text. By replacing phonetic-based retrieval with audio-based candidate retrieval and coupling it with an end-to-end generative correction step, the method handles cases where word forms differ substantially from the ground-truth entities. Across AISHELL and a self-constructed BuzzWord set, the approach outperforms a strong phonetic-edit-distance baseline, achieving lower CER and higher NE-Recall, and remains effective with commercial ASR systems. The method also integrates entity rejection and demonstrates meaningful insights via attention analyses, offering a practical, generalizable solution for robust domain-specific entity transcription in ASR, with publicly available training data and code.

Abstract

End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities, causing catastrophic failures in downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when the forms of the wrongly-transcribed words(s) and the ground-truth entity are significantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entity errors in ASR transcripts and replace the text with correct entities. This method is effective in scenarios of word form difference. We test our method using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring significant improvement to entity accuracy. The self-constructed training data and test set is publicly available at github.com/L6-NLP/Generative-Annotation-NEC.

Paper Structure

This paper contains 26 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The drawback of NEC methods based on phonetic-level similarity algorithms in scenarios when the word form of the ground-truth entity is greatly different from that of the to-be-corrected text.
  • Figure 2: Our method consists of two steps: The left part (SS) denotes datastore construction and candidate entity retrieval. The right part (GL) denotes concatenating candidate entities and ASR transcript as a prompt to guide model generate errors in the transcript. Finally, error correction is done by text replacement.
  • Figure 3: Constructing generative labeling training data using speech with ground-truth transcript.
  • Figure 4: Heatmaps of Cross Attention in the last layer and Self Attention in each layer of our generative annotation model. Regarding Self Attention, we analyze the relationship between the output result "米德仲尼" and the prompt. The candidate entity is "Midjourney," the incorrectly transcribed text is "米德仲尼", and the annotation result is "米德仲尼".
  • Figure 5: Error Correction CER at different retrieval threshold.
  • ...and 2 more figures