Table of Contents
Fetching ...

LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

Pai Zhu, Quan Wang, Dhruuv Agarwal, Kurt Partridge

TL;DR

This paper tackles robust custom keyword spotting by addressing confusable keywords through scalable data augmentation. It introduces LLM-generated vowel-based confusable keyword groups and TTS-based style diversification (LLM/TTS augmentation) to train a GE2E-based KWS model, supplemented by a new vowel-group metric c-AUC to capture confusable performance. The approach yields consistent gains on Speech Commands (lower EER and higher AUC) and substantial improvement in confusable detection (c-AUC), with notable benefits on LibriPhrase-1s as well. The work demonstrates scalability, multilingual potential, and practical impact for user-centric activation in real-world, open-vocabulary KWS scenarios.

Abstract

Custom keyword spotting (KWS) allows detecting user-defined spoken keywords from streaming audio. This is achieved by comparing the embeddings from voice enrollments and input audio. State-of-the-art custom KWS models are typically trained contrastively using utterances whose keywords are randomly sampled from training dataset. These KWS models often struggle with confusing keywords, such as "blue" versus "glue". This paper introduces an effective way to augment the training with confusable utterances where keywords are generated and grouped from large language models (LLMs), and speech signals are synthesized with diverse speaking styles from text-to-speech (TTS) engines. To better measure user experience on confusable KWS, we define a new northstar metric using the average area under DET curve from confusable groups (c-AUC). Featuring high scalability and zero labor cost, the proposed method improves AUC by 3.7% and c-AUC by 11.3% on the Speech Commands testing set.

LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

TL;DR

This paper tackles robust custom keyword spotting by addressing confusable keywords through scalable data augmentation. It introduces LLM-generated vowel-based confusable keyword groups and TTS-based style diversification (LLM/TTS augmentation) to train a GE2E-based KWS model, supplemented by a new vowel-group metric c-AUC to capture confusable performance. The approach yields consistent gains on Speech Commands (lower EER and higher AUC) and substantial improvement in confusable detection (c-AUC), with notable benefits on LibriPhrase-1s as well. The work demonstrates scalability, multilingual potential, and practical impact for user-centric activation in real-world, open-vocabulary KWS scenarios.

Abstract

Custom keyword spotting (KWS) allows detecting user-defined spoken keywords from streaming audio. This is achieved by comparing the embeddings from voice enrollments and input audio. State-of-the-art custom KWS models are typically trained contrastively using utterances whose keywords are randomly sampled from training dataset. These KWS models often struggle with confusing keywords, such as "blue" versus "glue". This paper introduces an effective way to augment the training with confusable utterances where keywords are generated and grouped from large language models (LLMs), and speech signals are synthesized with diverse speaking styles from text-to-speech (TTS) engines. To better measure user experience on confusable KWS, we define a new northstar metric using the average area under DET curve from confusable groups (c-AUC). Featuring high scalability and zero labor cost, the proposed method improves AUC by 3.7% and c-AUC by 11.3% on the Speech Commands testing set.

Paper Structure

This paper contains 19 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An overview of the LLM-Synth4Kws framework: (a) LLM generating confusable keywords, style sampling for TTS, and training data augmentation. (b) Vowel group based evaluation and a new northstar metric.
  • Figure 2: Vowels in the English language.
  • Figure 3: Vowel grouped words generated by LLM (truncated).
  • Figure 4: DET curves for baseline and augmented model on Speech Commands.