Table of Contents
Fetching ...

Improving Rare-Word Recognition of Whisper in Zero-Shot Settings

Yash Jogi, Vaibhav Aggarwal, Shabari S Nair, Yash Verma, Aayush Kubba

TL;DR

This work tackles the persistent challenge of rare-word recognition in Whisper, a large open-source ASR model trained on about 680k hours of data. The authors propose a data-efficient, instruction-tuned approach to contextual biasing (B-Whisper) by fine-tuning Whisper-Large on roughly 670 hours of Common Voice English, using a novel prompt-crafting strategy and a weighted cross-entropy loss $L = sum_i w_i H(q_i, p_i)$ with $w_i = β$ for true-bias tokens and $β = 1.1$. The method yields substantial gains in zero-shot evaluations across 11 English datasets, with average relative improvements of $R ext{-}WER$ by $45.6\ ext{\%}$ and $OOV ext{-}WER$ by $60.8\%$ when compared to the baseline prompting approach, and demonstrates zero-shot generalization to unseen languages. This indicates a robust generalization capability by combining Whisper’s pre-training with targeted instruction-tuning, offering a practical path to improve domain-specific term recognition in resource-constrained settings.

Abstract

Whisper, despite being trained on 680K hours of web-scaled audio data, faces difficulty in recognising rare words like domain-specific terms, with a solution being contextual biasing through prompting. To improve upon this method, in this paper, we propose a supervised learning strategy to fine-tune Whisper for contextual biasing instruction. We demonstrate that by using only 670 hours of Common Voice English set for fine-tuning, our model generalises to 11 diverse open-source English datasets, achieving a 45.6% improvement in recognition of rare words and 60.8% improvement in recognition of words unseen during fine-tuning over the baseline method. Surprisingly, our model's contextual biasing ability generalises even to languages unseen during fine-tuning.

Improving Rare-Word Recognition of Whisper in Zero-Shot Settings

TL;DR

This work tackles the persistent challenge of rare-word recognition in Whisper, a large open-source ASR model trained on about 680k hours of data. The authors propose a data-efficient, instruction-tuned approach to contextual biasing (B-Whisper) by fine-tuning Whisper-Large on roughly 670 hours of Common Voice English, using a novel prompt-crafting strategy and a weighted cross-entropy loss with for true-bias tokens and . The method yields substantial gains in zero-shot evaluations across 11 English datasets, with average relative improvements of by and by when compared to the baseline prompting approach, and demonstrates zero-shot generalization to unseen languages. This indicates a robust generalization capability by combining Whisper’s pre-training with targeted instruction-tuning, offering a practical path to improve domain-specific term recognition in resource-constrained settings.

Abstract

Whisper, despite being trained on 680K hours of web-scaled audio data, faces difficulty in recognising rare words like domain-specific terms, with a solution being contextual biasing through prompting. To improve upon this method, in this paper, we propose a supervised learning strategy to fine-tune Whisper for contextual biasing instruction. We demonstrate that by using only 670 hours of Common Voice English set for fine-tuning, our model generalises to 11 diverse open-source English datasets, achieving a 45.6% improvement in recognition of rare words and 60.8% improvement in recognition of words unseen during fine-tuning over the baseline method. Surprisingly, our model's contextual biasing ability generalises even to languages unseen during fine-tuning.

Paper Structure

This paper contains 8 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of B-Whisper. This figure depicts the train-time inputs and outputs for B-Whisper with an example. In the bottom-most sequence of squares, colored squares contain task-specifiers and transcript tokens, whereas colorless squares contain prompt tokens. In this case, the word "spirometry" is the true-bias word for the reference transcript "spirometry measures lung function accurately". As such, the tokens of "spirometry" are given weight $\beta > 1$ and the rest of the tokens are given weight $1$ in Cross Entropy Loss, as shown in the top-most sequence of squares.
  • Figure 2: Average values across 11 datasets of (a) WER and U-WER, (b) R-WER, (c) WER, for biasing list sizes $35$, $70$ and $150$ for Whisper + P and B-Whisper + P. Here, in case of (c), the biasing list contains only false-bias words.