Table of Contents
Fetching ...

Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning

Yingyi Ma, Zhe Liu, Ozlem Kalinli

TL;DR

The paper tackles domain adaptation for LLM-based ASR by addressing prompt mismatch when adapting with text-only data. It introduces a two-step soft-prompt fine-tuning method that learns a domain-specific soft prompt $S_{ abla}$ as a pseudo audio embedding, first guiding the adaptation by freezing the rest of the model and then fine-tuning the decoder with this prompt. Empirical results on entity-heavy music and chatbot domains show consistent WER and EER improvements over baselines, with additional gains when combined with external LM fusion. The approach emphasizes prompt length alignment to domain utterance characteristics and demonstrates robust domain knowledge transfer with promising scalability to multi-domain settings.

Abstract

The advent of Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR). Prompting LLM with audio embeddings to generate transcriptions becomes the new state-of-the-art ASR. Despite LLMs being trained with an extensive amount of text corpora, high-quality domain-specific text data can still significantly enhance ASR performance on domain adaptation tasks. Although LLM-based ASR can naturally incorporate more text corpora by fine-tuning the LLM decoder, fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge. To mitigate this issue, we propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation. Experimental results show that text adaptation with our proposed method achieved a relative up to 9% Word Error Rate (WER) reduction and up to 18% Entity Error Rate (EER) reduction on the target domain compared to the baseline ASR. Combining this with domain-specific Language Model (LM) fusion can further improve the EER by a relative 2-5%

Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning

TL;DR

The paper tackles domain adaptation for LLM-based ASR by addressing prompt mismatch when adapting with text-only data. It introduces a two-step soft-prompt fine-tuning method that learns a domain-specific soft prompt as a pseudo audio embedding, first guiding the adaptation by freezing the rest of the model and then fine-tuning the decoder with this prompt. Empirical results on entity-heavy music and chatbot domains show consistent WER and EER improvements over baselines, with additional gains when combined with external LM fusion. The approach emphasizes prompt length alignment to domain utterance characteristics and demonstrates robust domain knowledge transfer with promising scalability to multi-domain settings.

Abstract

The advent of Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR). Prompting LLM with audio embeddings to generate transcriptions becomes the new state-of-the-art ASR. Despite LLMs being trained with an extensive amount of text corpora, high-quality domain-specific text data can still significantly enhance ASR performance on domain adaptation tasks. Although LLM-based ASR can naturally incorporate more text corpora by fine-tuning the LLM decoder, fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge. To mitigate this issue, we propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation. Experimental results show that text adaptation with our proposed method achieved a relative up to 9% Word Error Rate (WER) reduction and up to 18% Entity Error Rate (EER) reduction on the target domain compared to the baseline ASR. Combining this with domain-specific Language Model (LM) fusion can further improve the EER by a relative 2-5%

Paper Structure

This paper contains 15 sections, 2 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The training process of LLM-based ASR is illustrated in (a). The domain adaptation fine-tuning with soft prompt as pseudo audio embedding is illustrated in (b) and (c). We first train soft prompt $S\in \mathbb{R}^{d \times e}$ by freezing all other components, then fine-tune LLM with the trained soft prompt for more effective text adaptation. The freezed components are shown in grey.