Table of Contents
Fetching ...

kNN For Whisper And Its Effect On Bias And Speaker Adaptation

Maya K. Nachesa, Vlad Niculae

TL;DR

The paper addresses uneven ASR performance across languages, domains, and speakers and proposes a token-level $k$NN augmentation for Whisper to adapt at inference without fine-tuning. It extends the non-parametric datastore approach to speech, storing hidden states and tokens and interpolating the $k$NN-derived distribution with the base model via $p(y) = \lambda p_{kNN}(y) + (1-\lambda) p_{model}(y)$, where $p_{kNN}(y) = \sum_i 1_{y=v_i} p(k_i)$. Evaluations on VoxPopuli, LibriSpeech, CommonVoice NL, and RixVox reveal dataset-dependent gains, with larger improvements on non-Libri datasets and when using larger or more comprehensive datastores for speaker adaptation, balanced against decoding speed. The work demonstrates a practical, non-parametric route to improve a large transformer-based ASR while exposing nuanced effects on bias across gender, accent, and age groups, highlighting both opportunities and limitations for real-world deployment.

Abstract

Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

kNN For Whisper And Its Effect On Bias And Speaker Adaptation

TL;DR

The paper addresses uneven ASR performance across languages, domains, and speakers and proposes a token-level NN augmentation for Whisper to adapt at inference without fine-tuning. It extends the non-parametric datastore approach to speech, storing hidden states and tokens and interpolating the NN-derived distribution with the base model via , where . Evaluations on VoxPopuli, LibriSpeech, CommonVoice NL, and RixVox reveal dataset-dependent gains, with larger improvements on non-Libri datasets and when using larger or more comprehensive datastores for speaker adaptation, balanced against decoding speed. The work demonstrates a practical, non-parametric route to improve a large transformer-based ASR while exposing nuanced effects on bias across gender, accent, and age groups, highlighting both opportunities and limitations for real-world deployment.

Abstract

Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level nearest neighbor search (NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

Paper Structure

This paper contains 24 sections, 3 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: WERs on VoxPopuli.en dev with Whisper large-v3 for different $k$s (line color), $T$s (line width), and $\lambda$ (x-axis).
  • Figure 2: WER per age group for CommonVoice 18.0 NL using Whisper large-v3. The horizontal dashed lines represent the overall results without and with $k$NN. The numbers in the graph indicate the bin count.