End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients
Kesavaraj V, Anuprabha M, Anil Kumar Vuppala
TL;DR
This work tackles user-defined keyword spotting (UDKWS) by addressing the limitations of short-term spectral features in distinguishing closely related pronunciations. It introduces Shifted Delta Coefficients (SDC), computed from Mel-spectrograms, to capture long-term temporal dynamics and integrates them into an end-to-end cross-attention architecture that aligns audio and text representations for keyword validation. Extensive experiments across LibriPhrase, Google Commands, and Qualcomm datasets show that SDC outperforms MFCC and other baselines, with the best configuration $40$-$1$-$3$-$8$ yielding notable gains in AUC and EER, and competitive results against state-of-the-art UDKWS methods. These findings demonstrate the value of incorporating long-range temporal context for personalized wake-word systems, with potential for hybrid feature approaches in future work.
Abstract
Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pairs, due to their limited capability in capturing the temporal dynamics of the speech signal. To address this challenge, we propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability (transition between connecting phonemes) by incorporating long-term temporal information. The performance of the SDC feature is compared with various baseline features across four different datasets using a cross-attention based end-to-end system. Additionally, various configurations of SDC are explored to find the suitable temporal context for the UDKWS task. The experimental results reveal that the SDC feature outperforms the MFCC baseline feature, exhibiting an improvement of 8.32% in area under the curve (AUC) and 8.69% in terms of equal error rate (EER) on the challenging Libriphrase-hard dataset. Moreover, the proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.
