GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples
Harry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun Jin Park, Dhruuv Agarwal, Quan Wang
TL;DR
This work tackles the challenge of keyword spotting (KWS) models misclassifying confusable phrases by introducing GraphemeAug, a systematic method to generate confusables via grapheme edits and large-scale TTS data. The approach leverages AudioLM with style transfer to produce high-quality synthetic training data and a two-stage streaming model to handle real-time inference. Key findings show that style-transfer TTS yields notable quality gains, and synthetic confusables substantially improve robustness to near-neighbor phrases without harming true positives; increasing confusable diversity further enhances generalization. The results imply a practical pathway to robust KWS in dynamic language use, where confusables arise from evolving pronunciation and phrasing in real-world settings.
Abstract
Spoken Keyword Spotting (KWS) is the task of distinguishing between the presence and absence of a keyword in audio. The accuracy of a KWS model hinges on its ability to correctly classify examples close to the keyword and non-keyword boundary. These boundary examples are often scarce in training data, limiting model performance. In this paper, we propose a method to systematically generate adversarial examples close to the decision boundary by making insertion/deletion/substitution edits on the keyword's graphemes. We evaluate this technique on held-out data for a popular keyword and show that the technique improves AUC on a dataset of synthetic hard negatives by 61% while maintaining quality on positives and ambient negative audio data.
