H-PRM: A Pluggable Hotword Pre-Retrieval Module for Various Speech Recognition Systems
Huangyu Dai, Lingtao Mao, Ben Chen, Zihan Wang, Zihan Liang, Ying Han, Chenyi Lei, Han Li
TL;DR
The paper tackles the challenge of hotword recognition when large hotword inventories are used in ASR. It introduces H-PRM, a pluggable hotword pre-retrieval module that relies on phonemic embeddings and a CNN-based cosine similarity score to pre-rank hotwords and extract top-N candidates for integration into both traditional ASR systems (e.g., SeACo-Paraformer) and Audio LLM workflows via prompts. A novel training trick with iterative hard-sample mining enhances discriminative power, and comprehensive experiments across multiple datasets demonstrate substantial improvements in post-recall rate (PRR) and reductions in MER, even as hotword lists scale to thousands. The results show that phoneme-level cross-modal matching is more effective than audio- or text-based alternatives, enabling robust hotword customization with a plug-and-play design that benefits diverse ASR architectures and prompting strategies. These findings offer a practical, scalable path for accurate hotword adoption in real-world ASR systems and Audio LLMs, with potential for further integration into LLM-based contextual ASR.
Abstract
Hotword customization is crucial in ASR to enhance the accuracy of domain-specific terms. It has been primarily driven by the advancements in traditional models and Audio large language models (LLMs). However, existing models often struggle with large-scale hotwords, as the recognition rate drops dramatically with the number of hotwords increasing. In this paper, we introduce a novel hotword customization system that utilizes a hotword pre-retrieval module (H-PRM) to identify the most relevant hotword candidate by measuring the acoustic similarity between the hotwords and the speech segment. This plug-and-play solution can be easily integrated into traditional models such as SeACo-Paraformer, significantly enhancing hotwords post-recall rate (PRR). Additionally, we incorporate H-PRM into Audio LLMs through a prompt-based approach, enabling seamless customization of hotwords. Extensive testing validates that H-PRM can outperform existing methods, showing a new direction for hotword customization in ASR.
