Keyword-Guided Adaptation of Automatic Speech Recognition
Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet
TL;DR
This work tackles the challenge of recognizing domain-specific jargon and handling noisy environments in automatic speech recognition by introducing keyword-guided adaptation of Whisper. It leverages an open vocabulary keyword spotting (KWS) module to detect domain keywords and generate prompts that bias the Whisper decoder toward those terms, explored through two variants: KG-Whisper which fine-tunes the decoder, and KG-Whisper-PT which learns a prompt prefix while keeping all Whisper parameters frozen. A training strategy simulates KWS predictions by sampling positive and negative keywords and inserting them into the decoding context, with the KWS itself kept frozen. Across multilingual and out-of-domain datasets, the proposed methods consistently outperform Whisper baselines, achieving notable improvements in $WER$ and keyword $F1$, including an average $WER$ improvement of about 5.1% on unseen languages. The results demonstrate robust, scalable domain adaptation for ASR and pave the way for applying keyword-guided decoding to other encoder-decoder speech models.
Abstract
Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. We introduce two approaches to effectively steer the decoder towards these prompts: KG-Whisper, which is aimed at fine-tuning the Whisper decoder, and KG-Whisper-PT, which learns a prompt prefix. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates. Specifically, in unseen language generalization, we demonstrate an average WER improvement of 5.1% over Whisper.
