Continuously Learning New Words in Automatic Speech Recognition
Christian Huber, Alexander Waibel
TL;DR
This work tackles learning acronyms, named entities, and domain-specific words in ASR by introducing a self-supervised continual learning pipeline that biases decoding toward slide-derived new words using a memory-augmented ASR, and then adapts the model with a low-rank factorization of weights. The approach iterates across many talks: extract new words from slides, generate pseudo-labels for utterances containing them, and update the model by learning a compact low-rank correction $\overline{W} = \sum_{i=1}^k r_i s_i^T$. Empirical results show that new-word recall can exceed $80\%$ as words occur more frequently, while the overall WER remains close to the baseline, and the method scales efficiently across up to 66 talks. This has practical impact for deploying ASR in domains with scarce labeled data for specialized vocabulary, enabling more robust understanding of content-rich lectures and meetings.
Abstract
Despite recent advances, Automatic Speech Recognition (ASR) systems are still far from perfect. Typical errors include acronyms, named entities, and domain-specific special words for which little or no labeled data is available. To address the problem of recognizing these words, we propose a self-supervised continual learning approach: Given the audio of a lecture talk with the corresponding slides, we bias the model towards decoding new words from the slides by using a memory-enhanced ASR model from the literature. Then, we perform inference on the talk, collecting utterances that contain detected new words into an adaptation data set. Continual learning is then performed by training adaptation weights added to the model on this data set. The whole procedure is iterated for many talks. We show that with this approach, we obtain increasing performance on the new words when they occur more frequently (more than 80% recall) while preserving the general performance of the model.
