Table of Contents
Fetching ...

Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores

Jiaming Zhou, Shiwan Zhao, Hui Wang, Tian-Hao Zhang, Haoqin Sun, Xuechen Wang, Yong Qin

TL;DR

This work tackles zero-shot Chinese-English code-switching ASR by extending the kNN-CTC framework with dual monolingual datastores and a gated datastore selection mechanism. By per-frame selecting the most relevant monolingual datastore and calibrating the resulting $P_{kNN}$ with language-aware adjustments, the approach injects language-specific information while suppressing cross-language noise. The method yields consistent MER improvements over bilingual datastore baselines for both Conformer and Wav2vec2-XLSR backbones, with notable reductions on the TEST and MIX evaluation sets. The results, along with ablation analyses, demonstrate the practical value of retrieval-augmented CS-ASR and the importance of language-aware, gated datastore design.

Abstract

The kNN-CTC model has proven to be effective for monolingual automatic speech recognition (ASR). However, its direct application to multilingual scenarios like code-switching, presents challenges. Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. Our method selects the appropriate datastore for decoding each frame, ensuring the injection of language-specific information into the ASR process. We apply this framework to cutting-edge CTC-based models, developing an advanced CS-ASR system. Extensive experiments demonstrate the remarkable effectiveness of our gated datastore mechanism in enhancing the performance of zero-shot Chinese-English CS-ASR.

Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores

TL;DR

This work tackles zero-shot Chinese-English code-switching ASR by extending the kNN-CTC framework with dual monolingual datastores and a gated datastore selection mechanism. By per-frame selecting the most relevant monolingual datastore and calibrating the resulting with language-aware adjustments, the approach injects language-specific information while suppressing cross-language noise. The method yields consistent MER improvements over bilingual datastore baselines for both Conformer and Wav2vec2-XLSR backbones, with notable reductions on the TEST and MIX evaluation sets. The results, along with ablation analyses, demonstrate the practical value of retrieval-augmented CS-ASR and the importance of language-aware, gated datastore design.

Abstract

The kNN-CTC model has proven to be effective for monolingual automatic speech recognition (ASR). However, its direct application to multilingual scenarios like code-switching, presents challenges. Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. Our method selects the appropriate datastore for decoding each frame, ensuring the injection of language-specific information into the ASR process. We apply this framework to cutting-edge CTC-based models, developing an advanced CS-ASR system. Extensive experiments demonstrate the remarkable effectiveness of our gated datastore mechanism in enhancing the performance of zero-shot Chinese-English CS-ASR.
Paper Structure (11 sections, 9 equations, 2 figures, 5 tables)

This paper contains 11 sections, 9 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of our methodology employing dual monolingual datastores, with the color blue representing Chinese and green representing English. For each audio frame, two retrieval operations are conducted to identify the appropriate datastore. Following this, the CTC distribution is interpolated with the $k$NN distribution from the selected language (e.g., $P^{EN}_{kNN}$ for English), while the CTC distribution corresponding to the unselected language (in this case, Chinese) is diminished.
  • Figure 2: Visualization of average distances $d_{CN}$ and $d_{EN}$. The green dashed vertical line represents occurrences of CS. The color blue represents Chinese, while orange represents English.