Table of Contents
Fetching ...

CopyNE: Better Contextual ASR by Copying Named Entities

Shilin Zhou, Zhenghua Li, Yu Hong, Min Zhang, Zhefeng Wang, Baoxing Huai

TL;DR

CopyNE tackles the challenge of transcribing contextual named entities in ASR by treating entities as indivisible units and enabling direct copying from a contextual NE dictionary. It introduces a dedicated NE encoder, copy mechanism, and a copy loss to guide correct entity copying, integrating with a CTC-Transformer backbone to produce parses that can either generate tokens or copy entire entities. Across Chinese and English datasets, CopyNE achieves substantial NE-CER and WER improvements, with pronounced gains in the NE transcription accuracy, and maintains benefits when built on strong pre-trained models like Whisper. The approach offers practical value for downstream NLP tasks and shows robustness to dictionary noise, while outlining avenues for dynamic dictionary filtering and broader applicability.

Abstract

End-to-end automatic speech recognition (ASR) systems have made significant progress in general scenarios. However, it remains challenging to transcribe contextual named entities (NEs) in the contextual ASR scenario. Previous approaches have attempted to address this by utilizing the NE dictionary. These approaches treat entities as individual tokens and generate them token-by-token, which may result in incomplete transcriptions of entities. In this paper, we treat entities as indivisible wholes and introduce the idea of copying into ASR. We design a systematic mechanism called CopyNE, which can copy entities from the NE dictionary. By copying all tokens of an entity at once, we can reduce errors during entity transcription, ensuring the completeness of the entity. Experiments demonstrate that CopyNE consistently improves the accuracy of transcribing entities compared to previous approaches. Even when based on the strong Whisper, CopyNE still achieves notable improvements.

CopyNE: Better Contextual ASR by Copying Named Entities

TL;DR

CopyNE tackles the challenge of transcribing contextual named entities in ASR by treating entities as indivisible units and enabling direct copying from a contextual NE dictionary. It introduces a dedicated NE encoder, copy mechanism, and a copy loss to guide correct entity copying, integrating with a CTC-Transformer backbone to produce parses that can either generate tokens or copy entire entities. Across Chinese and English datasets, CopyNE achieves substantial NE-CER and WER improvements, with pronounced gains in the NE transcription accuracy, and maintains benefits when built on strong pre-trained models like Whisper. The approach offers practical value for downstream NLP tasks and shows robustness to dictionary noise, while outlining avenues for dynamic dictionary filtering and broader applicability.

Abstract

End-to-end automatic speech recognition (ASR) systems have made significant progress in general scenarios. However, it remains challenging to transcribe contextual named entities (NEs) in the contextual ASR scenario. Previous approaches have attempted to address this by utilizing the NE dictionary. These approaches treat entities as individual tokens and generate them token-by-token, which may result in incomplete transcriptions of entities. In this paper, we treat entities as indivisible wholes and introduce the idea of copying into ASR. We design a systematic mechanism called CopyNE, which can copy entities from the NE dictionary. By copying all tokens of an entity at once, we can reduce errors during entity transcription, ensuring the completeness of the entity. Experiments demonstrate that CopyNE consistently improves the accuracy of transcribing entities compared to previous approaches. Even when based on the strong Whisper, CopyNE still achieves notable improvements.
Paper Structure (31 sections, 15 equations, 5 figures, 6 tables)

This paper contains 31 sections, 15 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An example with homophonic errors. Pinyin is the Mandarin pronunciation of each token. The red text indicates the wrongly predicted token.
  • Figure 2: The CTC-Transformer model.
  • Figure 3: Our CopyNE model.
  • Figure 4: Effect of the Confidence Threshold $\gamma$.
  • Figure 5: Effect of $\beta$.