No Word Left Behind: Mitigating Prefix Bias in Open-Vocabulary Keyword Spotting
Yi Liu, Chuan-Che Jeff Huang, Xiao Quan
TL;DR
This work tackles prefix bias in open-vocabulary keyword spotting by introducing the Partial Overlap Benchmark (POB) to stress-test prefix-sharing cases and the Equal-weighting Position Scoring (EPS) module to attenuate position-biased scoring. EPS reduces early-position emphasis, improving robustness to partially overlapping enrollments, while POB provides a realistic evaluation regime and data for cross-domain generalization. Empirical results show EPS substantially lowers EER on POB benchmarks and raises accuracy on longer-prefix sets, with POB augmentation offering further gains at the cost of some standard benchmark performance. The proposed combination of EPS and POB achieves strong cross-domain performance, though it reveals trade-offs for short commands, motivating future work on data balance and more nuanced weighting strategies for diverse phrase lengths.
Abstract
Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterances. We find that existing OV-KWS solutions often overly bias the beginning phonemes of an enrollment, causing false triggers when negative enrollment-query-pairs share a prefix (``turn the volume up'' vs. ``turn the volume down''). We trace this to two factors: training data bias and position-biased cross-modal scoring. To address these limitations, we introduce the Partial Overlap Benchmark (POB) with two datasets, POB-Spark and POB-LibriPhrase (POB-LP), containing mismatched audio-text pairs with shared prefixes, and propose Equal-weighting Position Scoring (EPS), a lightweight decision layer. Using EPS alone reduces EER on POB-Spark from 64.4\% to 29.3\% and improves POB-LP accuracy from 87.6\% to 96.8\%, while maintaining performance on LibriPhrase and Google Speech Commands (GSC). With POB data added in training, our work achieves the best POB benchmark results while incurring the least amount of degradation on prior metrics among baselines. This degradation is most pronounced in GSC, which contains only one-word commands. We surface mitigating this trade-off as future work.
