Few-Shot Keyword Spotting from Mixed Speech
Junming Yuan, Ying Shi, LanTian Li, Dong Wang, Askar Hamdulla
TL;DR
This work tackles few-shot keyword spotting in mixed speech, where multiple keywords may co-occur, by applying Mix-training (MT) in pre-training and/or fine-tuning. MT uses $k$-hot labels and energy-unbiased mixing to reflect the superposition of speech signals, and is evaluated alongside traditional clean training and Mixup, as well as large SSL baselines HuBert and Wav2Vec 2.0. Experiments on LibriSpeech_960 and Google Speech Command v2 show MT substantially improves performance in mixed-speech conditions, with HuBert+MT delivering universally strong results. The findings suggest MT is a key technique for robust few-shot KWS in realistic, multi-speaker scenarios, and that combining MT with large SSL models yields the best practical performance; future work includes scaling MT pre-training and integrating MT with self-supervised learning.
Abstract
Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited training samples. A commonly used approach is the pre-training and fine-tuning framework. While effective in clean conditions, this approach struggles with mixed keyword spotting -- simultaneously detecting multiple keywords blended in an utterance, which is crucial in real-world applications. Previous research has proposed a Mix-Training (MT) approach to solve the problem, however, it has never been tested in the few-shot scenario. In this paper, we investigate the possibility of using MT and other relevant methods to solve the two practical challenges together: few-shot and mixed speech. Experiments conducted on the LibriSpeech and Google Speech Command corpora demonstrate that MT is highly effective on this task when employed in either the pre-training phase or the fine-tuning phase. Moreover, combining SSL-based large-scale pre-training (HuBert) and MT fine-tuning yields very strong results in all the test conditions.
