Table of Contents
Fetching ...

kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels

Jiaming Zhou, Shiwan Zhao, Yaqi Liu, Wenjia Zeng, Yong Chen, Yong Qin

TL;DR

The paper addresses the challenge of creating a fine-grained audio-text datastore for retrieval-augmented ASR by leveraging CTC pseudo labels to form frame-level key-value pairs, avoiding precise alignments. It introduces a skip-blank strategy to shrink the datastore and integrates a kNN retrieval mechanism with a CTC-based ASR system, combining $p_{kNN}(y|x)$ and $p_{CTC}(y|x)$ as $p(y|x)=\lambda p_{kNN}(y|x)+(1-\lambda)p_{CTC}(y|x)$. Empirical results show substantial improvements over vanilla CTC in both in-domain and cross-domain settings, with a pruned datastore delivering most of the gains while dramatically reducing storage requirements. The method enables faster unsupervised domain adaptation and demonstrates strong robustness across diverse Mandarin and Chinese dialect datasets, highlighting its practical impact for scalable, retrieval-augmented ASR.

Abstract

The success of retrieval-augmented language models in various natural language processing (NLP) tasks has been constrained in automatic speech recognition (ASR) applications due to challenges in constructing fine-grained audio-text datastores. This paper presents kNN-CTC, a novel approach that overcomes these challenges by leveraging Connectionist Temporal Classification (CTC) pseudo labels to establish frame-level audio-text key-value pairs, circumventing the need for precise ground truth alignments. We further introduce a skip-blank strategy, which strategically ignores CTC blank frames, to reduce datastore size. kNN-CTC incorporates a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems, achieving significant improvements in performance. By incorporating a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems and leveraging a fine-grained, pruned datastore, kNN-CTC consistently achieves substantial improvements in performance under various experimental settings. Our code is available at https://github.com/NKU-HLT/KNN-CTC.

kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels

TL;DR

The paper addresses the challenge of creating a fine-grained audio-text datastore for retrieval-augmented ASR by leveraging CTC pseudo labels to form frame-level key-value pairs, avoiding precise alignments. It introduces a skip-blank strategy to shrink the datastore and integrates a kNN retrieval mechanism with a CTC-based ASR system, combining and as . Empirical results show substantial improvements over vanilla CTC in both in-domain and cross-domain settings, with a pruned datastore delivering most of the gains while dramatically reducing storage requirements. The method enables faster unsupervised domain adaptation and demonstrates strong robustness across diverse Mandarin and Chinese dialect datasets, highlighting its practical impact for scalable, retrieval-augmented ASR.

Abstract

The success of retrieval-augmented language models in various natural language processing (NLP) tasks has been constrained in automatic speech recognition (ASR) applications due to challenges in constructing fine-grained audio-text datastores. This paper presents kNN-CTC, a novel approach that overcomes these challenges by leveraging Connectionist Temporal Classification (CTC) pseudo labels to establish frame-level audio-text key-value pairs, circumventing the need for precise ground truth alignments. We further introduce a skip-blank strategy, which strategically ignores CTC blank frames, to reduce datastore size. kNN-CTC incorporates a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems, achieving significant improvements in performance. By incorporating a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems and leveraging a fine-grained, pruned datastore, kNN-CTC consistently achieves substantial improvements in performance under various experimental settings. Our code is available at https://github.com/NKU-HLT/KNN-CTC.
Paper Structure (13 sections, 4 equations, 3 figures, 4 tables)

This paper contains 13 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of our $k$NN-CTC framework, which combines CTC and $k$NN models. The $k$NN model consists of two stages: datastore construction (blue dashed lines) and candidate retrieval (orange lines).
  • Figure 2: Location of keys
  • Figure 3: Effect of hyper-parameter $\lambda$ on DEV set