Table of Contents
Fetching ...

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation

Ruizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey

TL;DR

This work addresses the challenge of recognizing rare words and named entities in end-to-end ASR by strengthening neural contextual biasing. It introduces two techniques: (i) injecting biasing contexts into intermediate encoder layers to propagate contextual information through self-attention, and (ii) training-time text perturbation using alternative spellings to force the model to rely on the provided context. Implemented on a transducer with a Zipformer encoder and cross-attention biasing adapters, the approach achieves state-of-the-art results on LibriSpeech contextual ASR and shows solid gains on SPGISpeech and ConEC, while keeping decoding overhead practical. The findings demonstrate that simple, scalable context incorporation and perturbation-based training can substantially improve rare-word recognition in contextual ASR, with broad implications for personalized or domain-specific speech systems.

Abstract

Existing research suggests that automatic speech recognition (ASR) models can benefit from additional contexts (e.g., contact lists, user specified vocabulary). Rare words and named entities can be better recognized with contexts. In this work, we propose two simple yet effective techniques to improve context-aware ASR models. First, we inject contexts into the encoders at an early stage instead of merely at their last layers. Second, to enforce the model to leverage the contexts during training, we perturb the reference transcription with alternative spellings so that the model learns to rely on the contexts to make correct predictions. On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion, making the new state-of-the-art performance. On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation

TL;DR

This work addresses the challenge of recognizing rare words and named entities in end-to-end ASR by strengthening neural contextual biasing. It introduces two techniques: (i) injecting biasing contexts into intermediate encoder layers to propagate contextual information through self-attention, and (ii) training-time text perturbation using alternative spellings to force the model to rely on the provided context. Implemented on a transducer with a Zipformer encoder and cross-attention biasing adapters, the approach achieves state-of-the-art results on LibriSpeech contextual ASR and shows solid gains on SPGISpeech and ConEC, while keeping decoding overhead practical. The findings demonstrate that simple, scalable context incorporation and perturbation-based training can substantially improve rare-word recognition in contextual ASR, with broad implications for personalized or domain-specific speech systems.

Abstract

Existing research suggests that automatic speech recognition (ASR) models can benefit from additional contexts (e.g., contact lists, user specified vocabulary). Rare words and named entities can be better recognized with contexts. In this work, we propose two simple yet effective techniques to improve context-aware ASR models. First, we inject contexts into the encoders at an early stage instead of merely at their last layers. Second, to enforce the model to leverage the contexts during training, we perturb the reference transcription with alternative spellings so that the model learns to rely on the contexts to make correct predictions. On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion, making the new state-of-the-art performance. On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
Paper Structure (13 sections, 6 equations, 1 figure, 5 tables)

This paper contains 13 sections, 6 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: The transducer model (left) and its contextual biasing module (right). The red/blue dots mark the locations where the contexts can be injected to the main model, by plugging in the contextual biasing module. During training, the gray modules can be frozen, while only the yellow modules are trained.