Table of Contents
Fetching ...

Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment

Li Kewei, Zhou Hengshun, Shen Kai, Dai Yusheng, Du Jun

TL;DR

The paper tackles open-vocabulary keyword spotting by addressing confusable-word robustness and enrollment flexibility. It introduces Phoneme-Level Contrastive Learning (PLCL), which aligns phonemes across audio-text and audio-audio pairs in a shared embedding space using InfoNCE-based losses, a context-agnostic phoneme memory bank, and a third-category discriminator to generate hard negatives. The verifier supports text, audio, or audio-text enrollment, with modality-specific encoders and losses that jointly optimize cross-modal phoneme alignment and utterance scoring. On LibriPhrase, PLCL achieves state-of-the-art performance, with text+audio enrollment delivering the best results and ablations confirming the critical role of phoneme-level alignment and hard-negative augmentation for robust KWS.

Abstract

User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct confusable negatives for data augmentation. Based on this, a third-category discriminator is specifically designed to distinguish hard negatives. Overall, we develop a robust and flexible KWS system, supporting different modality enrollment methods within a unified framework. Verified on the LibriPhrase dataset, the proposed approach achieves state-of-the-art performance.

Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment

TL;DR

The paper tackles open-vocabulary keyword spotting by addressing confusable-word robustness and enrollment flexibility. It introduces Phoneme-Level Contrastive Learning (PLCL), which aligns phonemes across audio-text and audio-audio pairs in a shared embedding space using InfoNCE-based losses, a context-agnostic phoneme memory bank, and a third-category discriminator to generate hard negatives. The verifier supports text, audio, or audio-text enrollment, with modality-specific encoders and losses that jointly optimize cross-modal phoneme alignment and utterance scoring. On LibriPhrase, PLCL achieves state-of-the-art performance, with text+audio enrollment delivering the best results and ablations confirming the critical role of phoneme-level alignment and hard-negative augmentation for robust KWS.

Abstract

User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct confusable negatives for data augmentation. Based on this, a third-category discriminator is specifically designed to distinguish hard negatives. Overall, we develop a robust and flexible KWS system, supporting different modality enrollment methods within a unified framework. Verified on the LibriPhrase dataset, the proposed approach achieves state-of-the-art performance.
Paper Structure (18 sections, 8 equations, 3 figures, 3 tables)

This paper contains 18 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overall architecture of the proposed model PLCL. The input consists of a query audio paired with either enrollment text or audio, based on the enrollment data, the output is a score used to determine whether the query matches the enrollment word.
  • Figure 2: Visualization of attention maps. The positive examples (a) and (d) target keyword is "the house", the hard negative examples (b) and (e) target keyword is "information" while the query audio is "induration", the easy negative examples (c) and (f) target keyword is "the captain" while the query audio is "swear allegiance".
  • Figure 3: Visualization of t-SNE for various phonemes.