kNN For Whisper And Its Effect On Bias And Speaker Adaptation

Maya K. Nachesa; Vlad Niculae

kNN For Whisper And Its Effect On Bias And Speaker Adaptation

Maya K. Nachesa, Vlad Niculae

TL;DR

The paper addresses uneven ASR performance across languages, domains, and speakers and proposes a token-level $k$NN augmentation for Whisper to adapt at inference without fine-tuning. It extends the non-parametric datastore approach to speech, storing hidden states and tokens and interpolating the $k$NN-derived distribution with the base model via $p(y) = \lambda p_{kNN}(y) + (1-\lambda) p_{model}(y)$, where $p_{kNN}(y) = \sum_i 1_{y=v_i} p(k_i)$. Evaluations on VoxPopuli, LibriSpeech, CommonVoice NL, and RixVox reveal dataset-dependent gains, with larger improvements on non-Libri datasets and when using larger or more comprehensive datastores for speaker adaptation, balanced against decoding speed. The work demonstrates a practical, non-parametric route to improve a large transformer-based ASR while exposing nuanced effects on bias across gender, accent, and age groups, highlighting both opportunities and limitations for real-world deployment.

Abstract

Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

kNN For Whisper And Its Effect On Bias And Speaker Adaptation

TL;DR

The paper addresses uneven ASR performance across languages, domains, and speakers and proposes a token-level

NN augmentation for Whisper to adapt at inference without fine-tuning. It extends the non-parametric datastore approach to speech, storing hidden states and tokens and interpolating the

NN-derived distribution with the base model via

, where

. Evaluations on VoxPopuli, LibriSpeech, CommonVoice NL, and RixVox reveal dataset-dependent gains, with larger improvements on non-Libri datasets and when using larger or more comprehensive datastores for speaker adaptation, balanced against decoding speed. The work demonstrates a practical, non-parametric route to improve a large transformer-based ASR while exposing nuanced effects on bias across gender, accent, and age groups, highlighting both opportunities and limitations for real-world deployment.

Abstract

nearest neighbor search (

NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from

NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

kNN For Whisper And Its Effect On Bias And Speaker Adaptation

TL;DR

Abstract

kNN For Whisper And Its Effect On Bias And Speaker Adaptation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)