Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment
Li Kewei, Zhou Hengshun, Shen Kai, Dai Yusheng, Du Jun
TL;DR
The paper tackles open-vocabulary keyword spotting by addressing confusable-word robustness and enrollment flexibility. It introduces Phoneme-Level Contrastive Learning (PLCL), which aligns phonemes across audio-text and audio-audio pairs in a shared embedding space using InfoNCE-based losses, a context-agnostic phoneme memory bank, and a third-category discriminator to generate hard negatives. The verifier supports text, audio, or audio-text enrollment, with modality-specific encoders and losses that jointly optimize cross-modal phoneme alignment and utterance scoring. On LibriPhrase, PLCL achieves state-of-the-art performance, with text+audio enrollment delivering the best results and ablations confirming the critical role of phoneme-level alignment and hard-negative augmentation for robust KWS.
Abstract
User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct confusable negatives for data augmentation. Based on this, a third-category discriminator is specifically designed to distinguish hard negatives. Overall, we develop a robust and flexible KWS system, supporting different modality enrollment methods within a unified framework. Verified on the LibriPhrase dataset, the proposed approach achieves state-of-the-art performance.
