Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR
Shaojun Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, ZongYao Li, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang
TL;DR
This work tackles end-to-end ASR under speaker mismatch with scarce adaptation data by introducing Speaker-Smoothed kNN, a token-level retrieval approach that augments a pre-trained ASR model during decoding. By incorporating x-vector embeddings, the method dynamically adjusts the kNN interpolation parameters $T_s$ and $\\lambda_s$, enabling on-the-fly adaptation and robust handling of sparse, speaker-specific data. Empirical results on KeSpeech and MagicData show the approach can match fine-tuning in in-domain settings and achieve state-of-the-art CER reductions in all-domain scenarios, including scenarios with speaker changes. The method also demonstrates a practical inference cost and offers a pathway for efficient, data-rich adaptation in real-world multilingual ASR systems.
Abstract
Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model output by finding correctly pronounced tokens from its pre-built datastore during the decoding phase. Moreover, we utilize x-vector to dynamically adjust kNN interpolation parameters for data sparsity issue. This approach was validated using KeSpeech and MagicData corpora under in-domain and all-domain settings. Our method consistently performs comparably to fine-tuning without the associated performance degradation during speaker changes. Furthermore, in the all-domain setting, our method achieves state-of-the-art results, reducing the CER in both single speaker and multi-speaker test scenarios.
