Table of Contents
Fetching ...

Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Yangui Fang, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong, Kai Yu

TL;DR

This work tackles domain adaptation for Speech LLMs in low-resource scenarios by proposing a text-only fine-tuning strategy that leverages target-domain text without additional audio. A LoRA-based adaptation of the LLM decoder is used, with a real-time alignment evaluation that preserves speech-text cross-modal alignment during text updates. Experiments across LibriSpeech, SlideSpeech, Medical, and GigaSpeech show that text-only fine-tuning delivers strong cross-domain generalization and preserves source-domain performance, albeit with some trade-offs in target-domain WER compared to full speech fine-tuning. The approach reduces reliance on costly speech data while remaining scalable, with future work exploring hybrid approaches that combine text and limited speech supervision for further gains.

Abstract

Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.

Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

TL;DR

This work tackles domain adaptation for Speech LLMs in low-resource scenarios by proposing a text-only fine-tuning strategy that leverages target-domain text without additional audio. A LoRA-based adaptation of the LLM decoder is used, with a real-time alignment evaluation that preserves speech-text cross-modal alignment during text updates. Experiments across LibriSpeech, SlideSpeech, Medical, and GigaSpeech show that text-only fine-tuning delivers strong cross-domain generalization and preserves source-domain performance, albeit with some trade-offs in target-domain WER compared to full speech fine-tuning. The approach reduces reliance on costly speech data while remaining scalable, with future work exploring hybrid approaches that combine text and limited speech supervision for further gains.

Abstract

Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.

Paper Structure

This paper contains 39 sections, 16 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An overview of our two-stage training framework. Left: Source domain pretraining aims to achieve cross-modal alignment between speech and text, following mainstream training strategies. Right: Target domain adaptation seeks to maintain alignment while improving performance on the target domain. During training, the LLM is fine-tuned with LoRA using text-only data, while real-time evaluation of alignment is conducted using text-audio paired data.
  • Figure 2: Real-time perplexity (PPL) and accuracy (Acc) evaluation of speech alignment capabilities during text fine-tuning on the gigaspeech dataset
  • Figure 3: An overview of our two-stage training framework. Left: Source domain pretraining aims to achieve cross-modal alignment between speech and text, following mainstream training strategies. Right: Target domain adaptation seeks to maintain alignment while improving performance on the target domain. During training, the LLM is fine-tuned with LoRA using text-only data, while real-time evaluation of alignment is conducted using text-audio paired data.
  • Figure 4: Real-time perplexity (PPL) and accuracy (Acc) evaluation of speech alignment capabilities during text fine-tuning on the gigaspeech dataset