End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

Kesavaraj V; Anuprabha M; Anil Kumar Vuppala

End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

Kesavaraj V, Anuprabha M, Anil Kumar Vuppala

TL;DR

This work tackles user-defined keyword spotting (UDKWS) by addressing the limitations of short-term spectral features in distinguishing closely related pronunciations. It introduces Shifted Delta Coefficients (SDC), computed from Mel-spectrograms, to capture long-term temporal dynamics and integrates them into an end-to-end cross-attention architecture that aligns audio and text representations for keyword validation. Extensive experiments across LibriPhrase, Google Commands, and Qualcomm datasets show that SDC outperforms MFCC and other baselines, with the best configuration $40$-$1$-$3$-$8$ yielding notable gains in AUC and EER, and competitive results against state-of-the-art UDKWS methods. These findings demonstrate the value of incorporating long-range temporal context for personalized wake-word systems, with potential for hybrid feature approaches in future work.

Abstract

Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pairs, due to their limited capability in capturing the temporal dynamics of the speech signal. To address this challenge, we propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability (transition between connecting phonemes) by incorporating long-term temporal information. The performance of the SDC feature is compared with various baseline features across four different datasets using a cross-attention based end-to-end system. Additionally, various configurations of SDC are explored to find the suitable temporal context for the UDKWS task. The experimental results reveal that the SDC feature outperforms the MFCC baseline feature, exhibiting an improvement of 8.32% in area under the curve (AUC) and 8.69% in terms of equal error rate (EER) on the challenging Libriphrase-hard dataset. Moreover, the proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.

End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

TL;DR

yielding notable gains in AUC and EER, and competitive results against state-of-the-art UDKWS methods. These findings demonstrate the value of incorporating long-range temporal context for personalized wake-word systems, with potential for hybrid feature approaches in future work.

Abstract

Paper Structure (21 sections, 2 equations, 2 figures, 3 tables)

This paper contains 21 sections, 2 equations, 2 figures, 3 tables.

Introduction
Architecture
Audio Encoder
Text Encoder
Pattern Extractor
Pattern Discriminator
Feature Extraction
Mel Spectrogram
Mel-Frequency Cepstral Coefficients
Perceptual Linear Prediction
Relative Spectral - Perceptual Linear Prediction
Shifted Delta Coefficients
Experimental Setup
Database
Implementation Details
...and 6 more sections

Figures (2)

Figure 1: Proposed architecture for user-defined keyword spotting
Figure 2: Performance Analysis of SDC Configuration across four datasets. (a) & (b) illustrate the effect of varying d values. (c) & (d) illustrate the effect of varying k values.

End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

TL;DR

Abstract

End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

Authors

TL;DR

Abstract

Table of Contents

Figures (2)