Table of Contents
Fetching ...

Keyword-Guided Adaptation of Automatic Speech Recognition

Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

TL;DR

This work tackles the challenge of recognizing domain-specific jargon and handling noisy environments in automatic speech recognition by introducing keyword-guided adaptation of Whisper. It leverages an open vocabulary keyword spotting (KWS) module to detect domain keywords and generate prompts that bias the Whisper decoder toward those terms, explored through two variants: KG-Whisper which fine-tunes the decoder, and KG-Whisper-PT which learns a prompt prefix while keeping all Whisper parameters frozen. A training strategy simulates KWS predictions by sampling positive and negative keywords and inserting them into the decoding context, with the KWS itself kept frozen. Across multilingual and out-of-domain datasets, the proposed methods consistently outperform Whisper baselines, achieving notable improvements in $WER$ and keyword $F1$, including an average $WER$ improvement of about 5.1% on unseen languages. The results demonstrate robust, scalable domain adaptation for ASR and pave the way for applying keyword-guided decoding to other encoder-decoder speech models.

Abstract

Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. We introduce two approaches to effectively steer the decoder towards these prompts: KG-Whisper, which is aimed at fine-tuning the Whisper decoder, and KG-Whisper-PT, which learns a prompt prefix. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates. Specifically, in unseen language generalization, we demonstrate an average WER improvement of 5.1% over Whisper.

Keyword-Guided Adaptation of Automatic Speech Recognition

TL;DR

This work tackles the challenge of recognizing domain-specific jargon and handling noisy environments in automatic speech recognition by introducing keyword-guided adaptation of Whisper. It leverages an open vocabulary keyword spotting (KWS) module to detect domain keywords and generate prompts that bias the Whisper decoder toward those terms, explored through two variants: KG-Whisper which fine-tunes the decoder, and KG-Whisper-PT which learns a prompt prefix while keeping all Whisper parameters frozen. A training strategy simulates KWS predictions by sampling positive and negative keywords and inserting them into the decoding context, with the KWS itself kept frozen. Across multilingual and out-of-domain datasets, the proposed methods consistently outperform Whisper baselines, achieving notable improvements in and keyword , including an average improvement of about 5.1% on unseen languages. The results demonstrate robust, scalable domain adaptation for ASR and pave the way for applying keyword-guided decoding to other encoder-decoder speech models.

Abstract

Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. We introduce two approaches to effectively steer the decoder towards these prompts: KG-Whisper, which is aimed at fine-tuning the Whisper decoder, and KG-Whisper-PT, which learns a prompt prefix. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates. Specifically, in unseen language generalization, we demonstrate an average WER improvement of 5.1% over Whisper.
Paper Structure (10 sections, 4 equations, 2 figures, 4 tables)

This paper contains 10 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: An illustration of (a) KG-Whisper - the decoder receives keywords during fine-tuning, while the encoder module remains frozen. (b) KG-Whisper-PT - the entire Whisper model parameters are frozen and only a small number of prompt tokens are tuned.
  • Figure 2: Visualization of the cross-attention weights for KG-Whisper and KG-Whisper-PT. We illustrate how the model's attention is directed towards the identified keywords (Y-axis) as it predicts them within the transcribed text (X-axis).