Table of Contents
Fetching ...

Continuously Learning New Words in Automatic Speech Recognition

Christian Huber, Alexander Waibel

TL;DR

This work tackles learning acronyms, named entities, and domain-specific words in ASR by introducing a self-supervised continual learning pipeline that biases decoding toward slide-derived new words using a memory-augmented ASR, and then adapts the model with a low-rank factorization of weights. The approach iterates across many talks: extract new words from slides, generate pseudo-labels for utterances containing them, and update the model by learning a compact low-rank correction $\overline{W} = \sum_{i=1}^k r_i s_i^T$. Empirical results show that new-word recall can exceed $80\%$ as words occur more frequently, while the overall WER remains close to the baseline, and the method scales efficiently across up to 66 talks. This has practical impact for deploying ASR in domains with scarce labeled data for specialized vocabulary, enabling more robust understanding of content-rich lectures and meetings.

Abstract

Despite recent advances, Automatic Speech Recognition (ASR) systems are still far from perfect. Typical errors include acronyms, named entities, and domain-specific special words for which little or no labeled data is available. To address the problem of recognizing these words, we propose a self-supervised continual learning approach: Given the audio of a lecture talk with the corresponding slides, we bias the model towards decoding new words from the slides by using a memory-enhanced ASR model from the literature. Then, we perform inference on the talk, collecting utterances that contain detected new words into an adaptation data set. Continual learning is then performed by training adaptation weights added to the model on this data set. The whole procedure is iterated for many talks. We show that with this approach, we obtain increasing performance on the new words when they occur more frequently (more than 80% recall) while preserving the general performance of the model.

Continuously Learning New Words in Automatic Speech Recognition

TL;DR

This work tackles learning acronyms, named entities, and domain-specific words in ASR by introducing a self-supervised continual learning pipeline that biases decoding toward slide-derived new words using a memory-augmented ASR, and then adapts the model with a low-rank factorization of weights. The approach iterates across many talks: extract new words from slides, generate pseudo-labels for utterances containing them, and update the model by learning a compact low-rank correction . Empirical results show that new-word recall can exceed as words occur more frequently, while the overall WER remains close to the baseline, and the method scales efficiently across up to 66 talks. This has practical impact for deploying ASR in domains with scarce labeled data for specialized vocabulary, enabling more robust understanding of content-rich lectures and meetings.

Abstract

Despite recent advances, Automatic Speech Recognition (ASR) systems are still far from perfect. Typical errors include acronyms, named entities, and domain-specific special words for which little or no labeled data is available. To address the problem of recognizing these words, we propose a self-supervised continual learning approach: Given the audio of a lecture talk with the corresponding slides, we bias the model towards decoding new words from the slides by using a memory-enhanced ASR model from the literature. Then, we perform inference on the talk, collecting utterances that contain detected new words into an adaptation data set. Continual learning is then performed by training adaptation weights added to the model on this data set. The whole procedure is iterated for many talks. We show that with this approach, we obtain increasing performance on the new words when they occur more frequently (more than 80% recall) while preserving the general performance of the model.
Paper Structure (9 sections, 2 equations, 3 figures)

This paper contains 9 sections, 2 equations, 3 figures.

Figures (3)

  • Figure 1: Illustration of the continual learning: In each learning cycle the model is biased towards the new words from the slides of the current talk, inference is performed and pseudo-labels containing the new words are collected; then the model is adapted.
  • Figure 2: Results of the factorization experiment: Left: Number of parameters (MB, 16-bit) vs. F1-score after training with the new-words data for a factorized decoder and a factorized encoder+decoder. The baseline model has F1-score $0.402$. Right: Number of training samples per new word vs. F1-score for the different categories with a factorized encoder+decoder and $k=4$.
  • Figure 3: Results of the continual learning experiement: Left: General performance: Accumulated number of new words from slides versus WER (in %) on the Tedlium testset. Middle and right: Forward transfer: Number of training samples per new word versus forward transfer recall and precision.