Table of Contents
Fetching ...

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg

TL;DR

VoiceTextBlender introduces a single-stage joint speech-text SFT framework with LoRA to augment LLMs with robust speech capabilities while preserving text-only performance. The 3B VTBlender uses a Canary based speech encoder, a Conformer modality adapter, and a Gemma LM with LoRA, trained on mixed data including multilingual ASR/AST, speech-based QA from ASR data, and TTS generated mixed-modal SFT data. It achieves state-of-the-art results on several speech benchmarks relative to larger SpeechLMs and maintains competitive text-only performance, with emergent abilities for unseen prompts and multi-turn mixed-modal conversations. Ablation studies demonstrate the necessity of joint speech-text SFT over speech-only or frozen LM approaches. The work emphasizes practical data generation strategies and provides public release of models and code to accelerate SpeechLM research.

Abstract

Recent studies have augmented large language models (LLMs) with speech capabilities, leading to the development of speech language models (SpeechLMs). Earlier SpeechLMs focused on single-turn speech-based question answering (QA), where user input comprised a speech context and a text question. More recent studies have extended this to multi-turn conversations, though they often require complex, multi-stage supervised fine-tuning (SFT) with diverse data. Another critical challenge with SpeechLMs is catastrophic forgetting, where models optimized for speech tasks suffer significant degradation in text-only performance. To mitigate these issues, we propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the LLM backbone. Our joint SFT combines text-only SFT data with three types of speech-related data: speech recognition and translation, speech-based QA, and mixed-modal SFT. Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks while preserving the original capabilities on text-only tasks. Furthermore, our model shows emergent abilities of effectively handling previously unseen prompts and tasks, including multi-turn, mixed-modal inputs.

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

TL;DR

VoiceTextBlender introduces a single-stage joint speech-text SFT framework with LoRA to augment LLMs with robust speech capabilities while preserving text-only performance. The 3B VTBlender uses a Canary based speech encoder, a Conformer modality adapter, and a Gemma LM with LoRA, trained on mixed data including multilingual ASR/AST, speech-based QA from ASR data, and TTS generated mixed-modal SFT data. It achieves state-of-the-art results on several speech benchmarks relative to larger SpeechLMs and maintains competitive text-only performance, with emergent abilities for unseen prompts and multi-turn mixed-modal conversations. Ablation studies demonstrate the necessity of joint speech-text SFT over speech-only or frozen LM approaches. The work emphasizes practical data generation strategies and provides public release of models and code to accelerate SpeechLM research.

Abstract

Recent studies have augmented large language models (LLMs) with speech capabilities, leading to the development of speech language models (SpeechLMs). Earlier SpeechLMs focused on single-turn speech-based question answering (QA), where user input comprised a speech context and a text question. More recent studies have extended this to multi-turn conversations, though they often require complex, multi-stage supervised fine-tuning (SFT) with diverse data. Another critical challenge with SpeechLMs is catastrophic forgetting, where models optimized for speech tasks suffer significant degradation in text-only performance. To mitigate these issues, we propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the LLM backbone. Our joint SFT combines text-only SFT data with three types of speech-related data: speech recognition and translation, speech-based QA, and mixed-modal SFT. Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks while preserving the original capabilities on text-only tasks. Furthermore, our model shows emergent abilities of effectively handling previously unseen prompts and tasks, including multi-turn, mixed-modal inputs.

Paper Structure

This paper contains 23 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Our VTBlender 3B with joint SFT enables multi-turn, mixed-modal conversations, allowing user input in the form of pure speech, pure text, or a combination of both. It's worth noting that our speech-related SFT data consists of only single-turn interactions, while our text SFT data has multiple turns.
  • Figure 2: Model architecture. Only a pair of speech and text are depicted for simplicity, but the input can contain multiple segments of speech and text in any order.
  • Figure 3: Different types of SFT data are generated for training.
  • Figure 4: Generalization to unseen instructions.
  • Figure 5: Output style and format can be controlled.
  • ...and 5 more figures