Table of Contents
Fetching ...

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper

Jiaming Zhou, Shiwan Zhao, Jiabei He, Hui Wang, Wenjia Zeng, Yong Chen, Haoqin Sun, Aobo Kong, Yong Qin

TL;DR

This work tackles subdialect recognition in low-resource ASR by enhancing Whisper with retrieval augmentation that does not require parameter updates. It introduces M2R-Whisper, which combines sentence-level ICL-based pre-processing with token-level kNN-based post-processing, built from separate sentence- and token-level datastores derived from training data. The final output distribution is an interpolation \\tilde{P}(y|x) = λ P_{kNN}(y|x) + (1-λ) P(y|x), enabling rapid domain adaptation and improved CER across Mandarin and subdialects, with AISHELL-1 achieving CER as low as 4.11% and average relative reductions around 23–24%. The approach demonstrates that multi-stage, multi-scale retrieval can significantly boost low-resource ASR performance without updating model parameters, and provides insights into how sentence-level and token-level retrieval complement each other.

Abstract

State-of-the-art models like OpenAI's Whisper exhibit strong performance in multilingual automatic speech recognition (ASR), but they still face challenges in accurately recognizing diverse subdialects. In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. Building on the principles of in-context learning (ICL) and retrieval-augmented techniques, our method employs sentence-level ICL in the pre-processing stage to harness contextual information, while integrating token-level k-Nearest Neighbors (kNN) retrieval as a post-processing step to further refine the final output distribution. By synergistically combining sentence-level and token-level retrieval strategies, M2R-whisper effectively mitigates various types of recognition errors. Experiments conducted on Mandarin and subdialect datasets, including AISHELL-1 and KeSpeech, demonstrate substantial improvements in ASR accuracy, all achieved without any parameter updates.

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper

TL;DR

This work tackles subdialect recognition in low-resource ASR by enhancing Whisper with retrieval augmentation that does not require parameter updates. It introduces M2R-Whisper, which combines sentence-level ICL-based pre-processing with token-level kNN-based post-processing, built from separate sentence- and token-level datastores derived from training data. The final output distribution is an interpolation \\tilde{P}(y|x) = λ P_{kNN}(y|x) + (1-λ) P(y|x), enabling rapid domain adaptation and improved CER across Mandarin and subdialects, with AISHELL-1 achieving CER as low as 4.11% and average relative reductions around 23–24%. The approach demonstrates that multi-stage, multi-scale retrieval can significantly boost low-resource ASR performance without updating model parameters, and provides insights into how sentence-level and token-level retrieval complement each other.

Abstract

State-of-the-art models like OpenAI's Whisper exhibit strong performance in multilingual automatic speech recognition (ASR), but they still face challenges in accurately recognizing diverse subdialects. In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. Building on the principles of in-context learning (ICL) and retrieval-augmented techniques, our method employs sentence-level ICL in the pre-processing stage to harness contextual information, while integrating token-level k-Nearest Neighbors (kNN) retrieval as a post-processing step to further refine the final output distribution. By synergistically combining sentence-level and token-level retrieval strategies, M2R-whisper effectively mitigates various types of recognition errors. Experiments conducted on Mandarin and subdialect datasets, including AISHELL-1 and KeSpeech, demonstrate substantial improvements in ASR accuracy, all achieved without any parameter updates.
Paper Structure (10 sections, 4 equations, 3 figures, 3 tables)

This paper contains 10 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An overview of M2R-Whisper: a multi-stage and multi-scale retrieval augmentation framework. This framework enhances the Whisper ASR model by incorporating sentence-level retrieval in the pre-processing stage and token-level retrieval in the post-processing stage, with the goal of improving recognition accuracy, particularly in low-resource subdialect settings.
  • Figure 2: Illustration of our M2R-Whisper framework, which integrates multi-stage and multi-scale retrieval augmentation. The method consists of pre-processing with sentence-level ICL and post-processing with token-level $k$NN. Prior to testing, we construct separate sentence-level and token-level datastores from the training set. For each test audio, we retrieve the top $k$ most similar audio-text pairs as prompts, concatenating the prompt audio with the test audio. The corresponding prompt text is passed as a special token prefix to enhance ICL. During decoding, token-level retrieval augmentation is applied to generate the $k$NN distribution $P_{kNN}$, which is then interpolated with Whisper's output distribution $P$ to produce the final prediction.
  • Figure 3: CER (%) and RTF for different maximum numbers of prompts on the Southwestern development set