Table of Contents
Fetching ...

GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples

Harry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun Jin Park, Dhruuv Agarwal, Quan Wang

TL;DR

This work tackles the challenge of keyword spotting (KWS) models misclassifying confusable phrases by introducing GraphemeAug, a systematic method to generate confusables via grapheme edits and large-scale TTS data. The approach leverages AudioLM with style transfer to produce high-quality synthetic training data and a two-stage streaming model to handle real-time inference. Key findings show that style-transfer TTS yields notable quality gains, and synthetic confusables substantially improve robustness to near-neighbor phrases without harming true positives; increasing confusable diversity further enhances generalization. The results imply a practical pathway to robust KWS in dynamic language use, where confusables arise from evolving pronunciation and phrasing in real-world settings.

Abstract

Spoken Keyword Spotting (KWS) is the task of distinguishing between the presence and absence of a keyword in audio. The accuracy of a KWS model hinges on its ability to correctly classify examples close to the keyword and non-keyword boundary. These boundary examples are often scarce in training data, limiting model performance. In this paper, we propose a method to systematically generate adversarial examples close to the decision boundary by making insertion/deletion/substitution edits on the keyword's graphemes. We evaluate this technique on held-out data for a popular keyword and show that the technique improves AUC on a dataset of synthetic hard negatives by 61% while maintaining quality on positives and ambient negative audio data.

GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples

TL;DR

This work tackles the challenge of keyword spotting (KWS) models misclassifying confusable phrases by introducing GraphemeAug, a systematic method to generate confusables via grapheme edits and large-scale TTS data. The approach leverages AudioLM with style transfer to produce high-quality synthetic training data and a two-stage streaming model to handle real-time inference. Key findings show that style-transfer TTS yields notable quality gains, and synthetic confusables substantially improve robustness to near-neighbor phrases without harming true positives; increasing confusable diversity further enhances generalization. The results imply a practical pathway to robust KWS in dynamic language use, where confusables arise from evolving pronunciation and phrasing in real-world settings.

Abstract

Spoken Keyword Spotting (KWS) is the task of distinguishing between the presence and absence of a keyword in audio. The accuracy of a KWS model hinges on its ability to correctly classify examples close to the keyword and non-keyword boundary. These boundary examples are often scarce in training data, limiting model performance. In this paper, we propose a method to systematically generate adversarial examples close to the decision boundary by making insertion/deletion/substitution edits on the keyword's graphemes. We evaluate this technique on held-out data for a popular keyword and show that the technique improves AUC on a dataset of synthetic hard negatives by 61% while maintaining quality on positives and ambient negative audio data.

Paper Structure

This paper contains 17 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The data generation process to create positive, negative, and confusable synthetic datasets. This paper proposes incorporating the GraphemeAug algorithm (lower left corner), which is able to systematically generate a set of confusables.
  • Figure 2: ROC curves. (a) shows that TTS with style transfer generally outperforms standard TTS. (b) shows how adding confusables improves performance when comparing real positive examples to negatives synthesized for the same confusables, but spoken by other users. (c) shows that performance of the model with confusables shows similar quality on real positive and real non-confusable negative data.
  • Figure 3: eval-real-pos vs eval-ed3 AUC when training with different number of unique confusables with edit distance 3.
  • Figure 4: eval-real-pos vs eval-ed3 AUC when training with 10,000 unique confusables from different edit distances.