Vocal Tract Length Warped Features for Spoken Keyword Spotting

Achintya kr. Sarkar; Priyanka Dwivedi; Zheng-Hua Tan

Vocal Tract Length Warped Features for Spoken Keyword Spotting

Achintya kr. Sarkar, Priyanka Dwivedi, Zheng-Hua Tan

TL;DR

This work tackles vocal tract length (VTL) variability in spoken keyword spotting (KWS) by introducing VTL warped features into a DNN framework. It presents three strategies: (i) VTL-independent KWS trains with warped features across random $\alpha$ per epoch; (ii) VTL-independent$_{\alpha=1.00}$ KWS tests with unwarped features against the same DNN; and (iii) VTL-concatenation KWS concatenates all warped features into a high-dimensional input. Evaluations on the Google Command dataset show that the VTL-independent methods yield consistent accuracy gains over baselines, with statistical significance, while the concatenation approach is less effective due to larger model size. The results demonstrate that incorporating VTL warping can enhance KWS robustness to speaker variability, suggesting potential for personalized KWS and broader deployment.

Abstract

In this paper, we propose several methods that incorporate vocal tract length (VTL) warped features for spoken keyword spotting (KWS). The first method, VTL-independent KWS, involves training a single deep neural network (DNN) that utilizes VTL features with various warping factors. During training, a specific VTL feature is randomly selected per epoch, allowing the exploration of VTL variations. During testing, the VTL features with different warping factors of a test utterance are scored against the DNN and combined with equal weight. In the second method scores the conventional features of a test utterance (without VTL warping) against the DNN. The third method, VTL-concatenation KWS, concatenates VTL warped features to form high-dimensional features for KWS. Evaluations carried out on the English Google Command dataset demonstrate that the proposed methods improve the accuracy of KWS.

Vocal Tract Length Warped Features for Spoken Keyword Spotting

TL;DR

per epoch; (ii) VTL-independent

KWS tests with unwarped features against the same DNN; and (iii) VTL-concatenation KWS concatenates all warped features into a high-dimensional input. Evaluations on the Google Command dataset show that the VTL-independent methods yield consistent accuracy gains over baselines, with statistical significance, while the concatenation approach is less effective due to larger model size. The results demonstrate that incorporating VTL warping can enhance KWS robustness to speaker variability, suggesting potential for personalized KWS and broader deployment.

Abstract

Paper Structure (11 sections, 5 equations, 2 figures, 3 tables)

This paper contains 11 sections, 5 equations, 2 figures, 3 tables.

Introduction
Proposed Methods
VTL warping factor
VTL-independent KWS
VTL-independent$_{\alpha=1.00}$ KWS
VTL-concatenation KWS
Classifiers
Experiment setup
Results and Discussions
conclusion
Acknowledgement

Figures (2)

Figure 1: Comparison of the class-wise performance of BCResNet-8, VTL-independent$_{\alpha=1.00}$-BCResNet-8 and VTL-independent-BCResNet-8 based KWS methods on the Google Command dataset.
Figure 2: Accuracy of KWS for different VTL warped factors during testing.

Vocal Tract Length Warped Features for Spoken Keyword Spotting

TL;DR

Abstract

Vocal Tract Length Warped Features for Spoken Keyword Spotting

Authors

TL;DR

Abstract

Table of Contents

Figures (2)