BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM
Xun Gong, Anqi Lv, Zhiming Wang, Huijia Zhu, Yanmin Qian
TL;DR
BR-ASR solves the scalability bottleneck in contextual ASR by introducing a cross-modal, contrastive speech-bias retrieval framework and a dynamic curriculum to mitigate homophone confusion. It leverages frozen speech encoders and two bias encoding strategies (AcousticBias and TextualBias) to align large-scale bias entries with speech representations, with fast FAISS-based retrieval achieving 20 ms per query at 200k bias size. Experimental results on LibriSpeech with Rare5k demonstrate state-of-the-art biased WER (2.8%/7.1%) at N=2000 biases, while scalability tests show only modest degradation (0.3% WER, 2.9% B-WER) up to 200k biases and 99.99% pruning. BR-ASR is shown to generalize across different SpeechLLMs and requires no architectural changes to the downstream ASR system, enabling practical deployment in industrial contexts.
Abstract
While speech large language models (SpeechLLMs) have advanced standard automatic speech recognition (ASR), contextual biasing for named entities and rare words remains challenging, especially at scale. To address this, we propose BR-ASR: a Bias Retrieval framework for large-scale contextual biasing (up to 200k entries) via two innovations: (1) speech-and-bias contrastive learning to retrieve semantically relevant candidates; (2) dynamic curriculum learning that mitigates homophone confusion which negatively impacts the final performance. The is a general framework that allows seamless integration of the retrieved candidates into diverse ASR systems without fine-tuning. Experiments on LibriSpeech test-clean/-other achieve state-of-the-art (SOTA) biased word error rates (B-WER) of 2.8%/7.1% with 2000 bias words, delivering 45% relative improvement over prior methods. BR-ASR also demonstrates high scalability: when expanding the bias list to 200k where traditional methods generally fail, it induces only 0.3 / 2.9% absolute WER / B-WER degradation with a 99.99% pruning rate and only 20ms latency per query on test-other.
