MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech
Junming Yuan, Ying Shi, Dong Wang, Lantian Li, Askar Hamdulla
TL;DR
This paper tackles few-shot keyword spotting in mixed speech by integrating Mix-Training (MT) into a self-supervised learning framework, yielding MT-HuBERT. The method derives a clean acoustic unit codebook from clean speech and uses per-frame $n$-hot targets with a multi-label BCE loss to disentangle and predict active units in mixtures, rather than memorizing merged patterns. Empirical results on GSC v2 show MT-HuBERT (pre-training) combined with MT adaptation delivers state-of-the-art performance across clean, 2-mix, and unseen 3-mix conditions, with pronounced gains in low-shot and high-overlap settings. The work demonstrates that self-supervised mix-training enables better generalization to real-world overlapped speech and paves the way for leveraging unlabeled data in mixed-speech KWS, with potential extensions to larger/multilingual corpora and other speech tasks.
Abstract
Few-shot keyword spotting aims to detect previously unseen keywords with very limited labeled samples. A pre-training and adaptation paradigm is typically adopted for this task. While effective in clean conditions, most existing approaches struggle with mixed keyword spotting--detecting multiple overlapping keywords within a single utterance--a capability essential for real-world applications. We have previously proposed a pre-training approach based on Mix-Training (MT) to tackle the mixed keyword detection problem and demonstrated its efficiency. However, this approach is fully supervised, unable to utilize vast unlabeled data. To this end, we propose Mix-Training HuBERT (MT-HuBERT), a self-supervised learning (SSL) pre-training framework that implements the MT criterion during pre-training. MT-HuBERT predicts, in a self-supervised manner, the clean acoustic units of each constituent signal from contextual cues, in contrast to predicting compositional patterns of mixed speech. Experiments conducted on the Google Speech Commands (GSC v2) corpus demonstrate that our proposed MT-HuBERT consistently outperforms several state-of-the-art baselines in few-shot KWS tasks under both mixed and clean conditions.
