Table of Contents
Fetching ...

Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words

Kento Nozawa, Takashi Masuko, Toru Taniguchi

TL;DR

This work investigates context-aware ASR by prompting a decoder-only LLM with keywords to disambiguate rare and ambiguous words without changing model architectures. By pairing PLaMo-100B with Whisper as the audio encoder and using a linear adapter to align audio features with text embeddings, the approach enables effective keyword-driven transcription across Japanese and English data. Empirical results show notable improvements in CER for Japanese datasets and strong gains in keyword accuracy, with some degradation in English-only scenarios due to limited English data and dataset biases. The findings demonstrate the practical potential of prompt-based contextualization for improving ASR in domain-specific and language-diverse settings, while highlighting the importance of dataset diversity and bias mitigation in training.

Abstract

We develop a large language model (LLM) based automatic speech recognition (ASR) system that can be contextualized by providing keywords as prior information in text prompts. We adopt decoder-only architecture and use our in-house LLM, PLaMo-100B, pre-trained from scratch using datasets dominated by Japanese and English texts as the decoder. We adopt a pre-trained Whisper encoder as an audio encoder, and the audio embeddings from the audio encoder are projected to the text embedding space by an adapter layer and concatenated with text embeddings converted from text prompts to form inputs to the decoder. By providing keywords as prior information in the text prompts, we can contextualize our LLM-based ASR system without modifying the model architecture to transcribe ambiguous words in the input audio accurately. Experimental results demonstrate that providing keywords to the decoder can significantly improve the recognition performance of rare and ambiguous words.

Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words

TL;DR

This work investigates context-aware ASR by prompting a decoder-only LLM with keywords to disambiguate rare and ambiguous words without changing model architectures. By pairing PLaMo-100B with Whisper as the audio encoder and using a linear adapter to align audio features with text embeddings, the approach enables effective keyword-driven transcription across Japanese and English data. Empirical results show notable improvements in CER for Japanese datasets and strong gains in keyword accuracy, with some degradation in English-only scenarios due to limited English data and dataset biases. The findings demonstrate the practical potential of prompt-based contextualization for improving ASR in domain-specific and language-diverse settings, while highlighting the importance of dataset diversity and bias mitigation in training.

Abstract

We develop a large language model (LLM) based automatic speech recognition (ASR) system that can be contextualized by providing keywords as prior information in text prompts. We adopt decoder-only architecture and use our in-house LLM, PLaMo-100B, pre-trained from scratch using datasets dominated by Japanese and English texts as the decoder. We adopt a pre-trained Whisper encoder as an audio encoder, and the audio embeddings from the audio encoder are projected to the text embedding space by an adapter layer and concatenated with text embeddings converted from text prompts to form inputs to the decoder. By providing keywords as prior information in the text prompts, we can contextualize our LLM-based ASR system without modifying the model architecture to transcribe ambiguous words in the input audio accurately. Experimental results demonstrate that providing keywords to the decoder can significantly improve the recognition performance of rare and ambiguous words.
Paper Structure (31 sections, 1 figure, 7 tables)