Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning
Yingyi Ma, Zhe Liu, Ozlem Kalinli
TL;DR
The paper tackles domain adaptation for LLM-based ASR by addressing prompt mismatch when adapting with text-only data. It introduces a two-step soft-prompt fine-tuning method that learns a domain-specific soft prompt $S_{ abla}$ as a pseudo audio embedding, first guiding the adaptation by freezing the rest of the model and then fine-tuning the decoder with this prompt. Empirical results on entity-heavy music and chatbot domains show consistent WER and EER improvements over baselines, with additional gains when combined with external LM fusion. The approach emphasizes prompt length alignment to domain utterance characteristics and demonstrates robust domain knowledge transfer with promising scalability to multi-domain settings.
Abstract
The advent of Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR). Prompting LLM with audio embeddings to generate transcriptions becomes the new state-of-the-art ASR. Despite LLMs being trained with an extensive amount of text corpora, high-quality domain-specific text data can still significantly enhance ASR performance on domain adaptation tasks. Although LLM-based ASR can naturally incorporate more text corpora by fine-tuning the LLM decoder, fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge. To mitigate this issue, we propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation. Experimental results show that text adaptation with our proposed method achieved a relative up to 9% Word Error Rate (WER) reduction and up to 18% Entity Error Rate (EER) reduction on the target domain compared to the baseline ASR. Combining this with domain-specific Language Model (LM) fusion can further improve the EER by a relative 2-5%
