Table of Contents
Fetching ...

Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR

Shaojun Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, ZongYao Li, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang

TL;DR

This work tackles end-to-end ASR under speaker mismatch with scarce adaptation data by introducing Speaker-Smoothed kNN, a token-level retrieval approach that augments a pre-trained ASR model during decoding. By incorporating x-vector embeddings, the method dynamically adjusts the kNN interpolation parameters $T_s$ and $\\lambda_s$, enabling on-the-fly adaptation and robust handling of sparse, speaker-specific data. Empirical results on KeSpeech and MagicData show the approach can match fine-tuning in in-domain settings and achieve state-of-the-art CER reductions in all-domain scenarios, including scenarios with speaker changes. The method also demonstrates a practical inference cost and offers a pathway for efficient, data-rich adaptation in real-world multilingual ASR systems.

Abstract

Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model output by finding correctly pronounced tokens from its pre-built datastore during the decoding phase. Moreover, we utilize x-vector to dynamically adjust kNN interpolation parameters for data sparsity issue. This approach was validated using KeSpeech and MagicData corpora under in-domain and all-domain settings. Our method consistently performs comparably to fine-tuning without the associated performance degradation during speaker changes. Furthermore, in the all-domain setting, our method achieves state-of-the-art results, reducing the CER in both single speaker and multi-speaker test scenarios.

Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR

TL;DR

This work tackles end-to-end ASR under speaker mismatch with scarce adaptation data by introducing Speaker-Smoothed kNN, a token-level retrieval approach that augments a pre-trained ASR model during decoding. By incorporating x-vector embeddings, the method dynamically adjusts the kNN interpolation parameters and , enabling on-the-fly adaptation and robust handling of sparse, speaker-specific data. Empirical results on KeSpeech and MagicData show the approach can match fine-tuning in in-domain settings and achieve state-of-the-art CER reductions in all-domain scenarios, including scenarios with speaker changes. The method also demonstrates a practical inference cost and offers a pathway for efficient, data-rich adaptation in real-world multilingual ASR systems.

Abstract

Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model output by finding correctly pronounced tokens from its pre-built datastore during the decoding phase. Moreover, we utilize x-vector to dynamically adjust kNN interpolation parameters for data sparsity issue. This approach was validated using KeSpeech and MagicData corpora under in-domain and all-domain settings. Our method consistently performs comparably to fine-tuning without the associated performance degradation during speaker changes. Furthermore, in the all-domain setting, our method achieves state-of-the-art results, reducing the CER in both single speaker and multi-speaker test scenarios.
Paper Structure (13 sections, 6 equations, 3 figures, 2 tables)

This paper contains 13 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: t-SNE result of the kNN datastore representation. A.x samples representation from the same subdialect, and B.x samples from different subdialects. Dots of different colors indicate different speaker or token clusters.
  • Figure 2: An overview of our proposed Speaker-Smoothed kNN framework
  • Figure 3: Left side is the CER trend of different methods when use different top k, right side is the CER trend in different sample datastore size. Both experiment done on the multi-speaker test of all-domain setting.