Table of Contents
Fetching ...

ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody

Jianan Pan, Yuanming Zhang, Kejie Huang

Abstract

Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.

ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody

Abstract

Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.
Paper Structure (15 sections, 8 equations, 3 figures, 4 tables)

This paper contains 15 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overall architecture of the proposed ProKWS.
  • Figure 2: t-SNE visualization of prosodic signatures across different accents and intents.
  • Figure 3: Score variation analysis for continuous intent change. The x-axis represents the interpolation coefficient $\alpha$ between imperative and interrogative prosody, and the y-axis represents the resulting score $s(\alpha)$.