kNN For Whisper And Its Effect On Bias And Speaker Adaptation
Maya K. Nachesa, Vlad Niculae
TL;DR
The paper addresses uneven ASR performance across languages, domains, and speakers and proposes a token-level $k$NN augmentation for Whisper to adapt at inference without fine-tuning. It extends the non-parametric datastore approach to speech, storing hidden states and tokens and interpolating the $k$NN-derived distribution with the base model via $p(y) = \lambda p_{kNN}(y) + (1-\lambda) p_{model}(y)$, where $p_{kNN}(y) = \sum_i 1_{y=v_i} p(k_i)$. Evaluations on VoxPopuli, LibriSpeech, CommonVoice NL, and RixVox reveal dataset-dependent gains, with larger improvements on non-Libri datasets and when using larger or more comprehensive datastores for speaker adaptation, balanced against decoding speed. The work demonstrates a practical, non-parametric route to improve a large transformer-based ASR while exposing nuanced effects on bias across gender, accent, and age groups, highlighting both opportunities and limitations for real-world deployment.
Abstract
Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.
