Table of Contents
Fetching ...

A Multitask Training Approach to Enhance Whisper with Contextual Biasing and Open-Vocabulary Keyword Spotting

Yuang Li, Min Zhang, Chang Su, Yinglu Li, Xiaosong Qiao, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Shimin Tao, Hao Yang

TL;DR

KWS-Whisper introduces an open-vocabulary keyword spotting module that operates on Whisper encoder states to bias decoding toward user-defined entities. By jointly training OV-KWS with a contextual-ASR task in a multitask framework, the system achieves substantial improvements in entity recall and reductions in MER on Mandarin and code-switching data. The OV-KWS module can also function as a plug-in for error correction and for prompting frozen Whisper models, enabling robust performance without full fine-tuning. Results on Aishell hot words and internal code-switching datasets demonstrate up to ~80% gains in entity recall and notable MER reductions, with recalls approaching or exceeding 95% in some setups. The work highlights practical paths to integrate open-vocabulary search with large ASR models for robust named-entity recognition in diverse linguistic contexts.

Abstract

The recognition of rare named entities, such as personal names and terminologies, is challenging for automatic speech recognition (ASR) systems, especially when they are not frequently observed in the training data. In this paper, we introduce keyword spotting enhanced Whisper (KWS-Whisper), a novel ASR system that leverages the Whisper model and performs open-vocabulary keyword spotting (OV-KWS) on the hidden states of the Whisper encoder to recognize user-defined named entities. These entities serve as prompts for the Whisper decoder. To optimize the model, we propose a multitask training approach that learns OV-KWS and contextual-ASR tasks. We evaluate our approach on Chinese Aishell hot word subsets and two internal code-switching test sets and show that it significantly improves the entity recall compared to the original Whisper model. Moreover, we demonstrate that the OV-KWS can be a plug-and-play module to enhance the ASR error correction methods and frozen Whisper models.

A Multitask Training Approach to Enhance Whisper with Contextual Biasing and Open-Vocabulary Keyword Spotting

TL;DR

KWS-Whisper introduces an open-vocabulary keyword spotting module that operates on Whisper encoder states to bias decoding toward user-defined entities. By jointly training OV-KWS with a contextual-ASR task in a multitask framework, the system achieves substantial improvements in entity recall and reductions in MER on Mandarin and code-switching data. The OV-KWS module can also function as a plug-in for error correction and for prompting frozen Whisper models, enabling robust performance without full fine-tuning. Results on Aishell hot words and internal code-switching datasets demonstrate up to ~80% gains in entity recall and notable MER reductions, with recalls approaching or exceeding 95% in some setups. The work highlights practical paths to integrate open-vocabulary search with large ASR models for robust named-entity recognition in diverse linguistic contexts.

Abstract

The recognition of rare named entities, such as personal names and terminologies, is challenging for automatic speech recognition (ASR) systems, especially when they are not frequently observed in the training data. In this paper, we introduce keyword spotting enhanced Whisper (KWS-Whisper), a novel ASR system that leverages the Whisper model and performs open-vocabulary keyword spotting (OV-KWS) on the hidden states of the Whisper encoder to recognize user-defined named entities. These entities serve as prompts for the Whisper decoder. To optimize the model, we propose a multitask training approach that learns OV-KWS and contextual-ASR tasks. We evaluate our approach on Chinese Aishell hot word subsets and two internal code-switching test sets and show that it significantly improves the entity recall compared to the original Whisper model. Moreover, we demonstrate that the OV-KWS can be a plug-and-play module to enhance the ASR error correction methods and frozen Whisper models.
Paper Structure (15 sections, 1 equation, 2 figures, 7 tables)

This paper contains 15 sections, 1 equation, 2 figures, 7 tables.

Figures (2)

  • Figure 1: (a) Entity Features are extracted using TTS followed by Whisper encoder. (b) Flowchart of the KWS-Whisper model. (c) Illustration of the OV-KWS module.
  • Figure 2: The weights for Whisper encoder layers.