Table of Contents
Fetching ...

Soundwave: Less is More for Speech-Text Alignment in LLMs

Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li

TL;DR

Soundwave tackles data inefficiency in speech–text LLM alignment by separating the problem into representation alignment and sequence-length reduction, implemented via a three-stage training framework. It introduces two adapters (alignment and shrinking) and leverages a frozen Whisper encoder with LoRA to bridge speech and text efficiently, achieving state-of-the-art AIR-Bench performance with roughly 10k hours of data (and competitive zero-shot translation) while using far less data than prior systems. The method combines auxiliary CTC loss, high-quality alignment data, dynamic data mixing, and instruction-focused fine-tuning to maintain conversational capabilities and knowledge-based QA. This approach significantly lowers training costs and data requirements for speech-capable LLMs, enabling broader accessibility and faster iteration, albeit with some limitations in ASR competitiveness and multilingual coverage that point to future scaling and data expansion.

Abstract

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.

Soundwave: Less is More for Speech-Text Alignment in LLMs

TL;DR

Soundwave tackles data inefficiency in speech–text LLM alignment by separating the problem into representation alignment and sequence-length reduction, implemented via a three-stage training framework. It introduces two adapters (alignment and shrinking) and leverages a frozen Whisper encoder with LoRA to bridge speech and text efficiently, achieving state-of-the-art AIR-Bench performance with roughly 10k hours of data (and competitive zero-shot translation) while using far less data than prior systems. The method combines auxiliary CTC loss, high-quality alignment data, dynamic data mixing, and instruction-focused fine-tuning to maintain conversational capabilities and knowledge-based QA. This approach significantly lowers training costs and data requirements for speech-capable LLMs, enabling broader accessibility and faster iteration, albeit with some limitations in ASR competitiveness and multilingual coverage that point to future scaling and data expansion.

Abstract

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.

Paper Structure

This paper contains 67 sections, 2 equations, 18 figures, 18 tables.

Figures (18)

  • Figure 1: AIR-Bench speech foundation tasks.
  • Figure 2: Training progress of Soundwave. The gray modules are frozen while the orange modules are updated.
  • Figure 3: We first select the features based on the peak of CTC prediction. Then, we use these features to query and gather auxiliary information from the original sequence. Finally, we fuse the two features to achieve shrinking.
  • Figure 4: Adding thought processes to address complicated problems and speech instructions.
  • Figure 5: Training curves of different strategies
  • ...and 13 more figures