Peeking Into The Future For Contextual Biasing
Ramaneswaran Selvakumar, Cindy Tseng, Eesung Kim, Vijendra Raj Apsingekar, Yun Tang
TL;DR
End-to-end ASR models struggle with rare or unseen named entities, which are critical for downstream tasks. The authors introduce a simple contextual biasing approach that uses multi-token future predictions to score candidates from a dynamic bias list, avoiding extra encoders or cross-attention and integrating bias into a unified decoding space. Training optimizes both multi-token prediction and entity-scoring losses, and inference combines static vocabulary with biased entities using gating and pruning. On Librispeech, the method yields substantial reductions in biased WER (up to 50.34% relative) with minimal impact on overall WER, demonstrating practical gains for NER-sensitive applications.
Abstract
While end-to-end (E2E) automatic speech recognition (ASR) models excel at general transcription, they struggle to recognize rare or unseen named entities (e.g., contact names, locations), which are critical for downstream applications like virtual assistants. In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. Instead of predicting only the next token, we simultaneously predict multiple future tokens, enabling the model to "peek into the future" and score potential candidate entities in the entity list. Moreover, our approach leverages the multi-token prediction logits directly without requiring additional entity encoders or cross-attention layers, significantly reducing architectural complexity. Experiments on Librispeech demonstrate that our approach achieves up to 50.34% relative improvement in named entity word error rate compared to the baseline AED model.
