Retrieval-Augmented Speech Recognition Approach for Domain Challenges
Peng Shen, Xugang Lu, Hisashi Kawai
TL;DR
This work tackles domain mismatch in automatic speech recognition by introducing a retrieval-augmented speech recognition framework that uses domain-specific textual content only during inference. The system integrates a content embedding database, domain-specific content retrieval, and an LLM-enhanced decoder guided by an instruction prompt to leverage retrieved text alongside audio features. A two-stage optimization trains the audio encoder first and then the LLM decoder with retrieved content, initialized from Whisper-large-v2 and ELYZA-Japanese-Llama-2-7b. On CSJ, the method achieves 3.7% CER on an out-of-domain test and 4.2% CER on the in-domain Eval1, with a 19.6% relative improvement over a strong baseline, demonstrating state-of-the-art performance with substantially less training data and preserving data privacy by avoiding domain data during training.
Abstract
Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Benefiting from the advantages of the RAG retrieval mechanism, our approach efficiently accesses locally available domain-specific documents, ensuring a convenient and effective process for solving domain mismatch problems. Experiments conducted on the CSJ database demonstrate that the proposed method significantly improves speech recognition accuracy and achieves state-of-the-art results on the CSJ dataset, even without relying on the full training data.
