Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units
Bolaji Yusuf, Jan "Honza" Černocký, Murat Saraçlar
TL;DR
This work tackles the gap where end-to-end keyword search (KWS) lags behind ASR-based approaches by introducing pretraining on untranscribed speech through acoustic unit discovery (AUD). The authors employ Hierarchical Subspace Hidden Markov Models (H-SHMM) to produce discrete acoustic units, form pseudo-words from unit sequences, and pretrain an E2E KWS model before finetuning on a small transcribed corpus; they also transfer the document encoder and switch to a grapheme-based query encoder during fine-tuning. Experiments on English Libri-light and Turkish BNTR show that AUD-based pretraining yields substantial ATWV gains with MFCC inputs, and even larger gains when using XLS-R-derived features for AUD, with gains correlating with AUD quality as measured by NMI. These results demonstrate a practical route to leverage large unlabeled speech for improving E2E KWS, with potential for scalable multilingual pretraining and improved unit-to-word segmentation strategies.
Abstract
End-to-end (E2E) keyword search (KWS) has emerged as an alternative and complimentary approach to conventional keyword search which depends on the output of automatic speech recognition (ASR) systems. While E2E methods greatly simplify the KWS pipeline, they generally have worse performance than their ASR-based counterparts, which can benefit from pretraining with untranscribed data. In this work, we propose a method for pretraining E2E KWS systems with untranscribed data, which involves using acoustic unit discovery (AUD) to obtain discrete units for untranscribed data and then learning to locate sequences of such units in the speech. We conduct experiments across languages and AUD systems: we show that finetuning such a model significantly outperforms a model trained from scratch, and the performance improvements are generally correlated with the quality of the AUD system used for pretraining.
