Open vocabulary keyword spotting through transfer learning from speech synthesis
Kesavaraj V, Anil Kumar Vuppala
TL;DR
This work tackles open-vocabulary keyword spotting by bridging audio and text representations through transfer learning from a pre-trained text-to-speech system. By embedding Tacotron 2 intermediate representations into the text encoder and aligning them with audio features via a cross-attention pattern extractor, the approach increases discrimination between closely related pronunciations. Ablation shows that the E3 Tacotron 2 layer (Bi-LSTM output) provides the best balance of performance and training efficiency, and the method demonstrates robustness to word length and out-of-vocabulary scenarios across multiple datasets, with substantial gains on LibriPhrase Hard. The proposed framework offers a practical path toward personalized, robust keyword spotting on edge devices, enabling more flexible voice interfaces without requiring explicit shared embedding spaces tuned for each new keyword.
Abstract
Identifying keywords in an open-vocabulary context is crucial for personalizing interactions with smart devices. Previous approaches to open vocabulary keyword spotting dependon a shared embedding space created by audio and text encoders. However, these approaches suffer from heterogeneous modality representations (i.e., audio-text mismatch). To address this issue, our proposed framework leverages knowledge acquired from a pre-trained text-to-speech (TTS) system. This knowledge transfer allows for the incorporation of awareness of audio projections into the text representations derived from the text encoder. The performance of the proposed approach is compared with various baseline methods across four different datasets. The robustness of our proposed model is evaluated by assessing its performance across different word lengths and in an Out-of-Vocabulary (OOV) scenario. Additionally, the effectiveness of transfer learning from the TTS system is investigated by analyzing its different intermediate representations. The experimental results indicate that, in the challenging LibriPhrase Hard dataset, the proposed approach outperformed the cross-modality correspondence detector (CMCD) method by a significant improvement of 8.22% in area under the curve (AUC) and 12.56% in equal error rate (EER).
