Table of Contents
Fetching ...

DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

TL;DR

This paper challenges the notion that speech instruction-tuning data is necessary to build instruction-following speech language models. It introduces DeSTA2, a data-construction framework that uses seed transcripts and a single LLM prompt to generate descriptive speech captions from rich speech metadata, training a frozen Whisper-Llama3 pipeline with a lightweight modality adaptor. The approach yields competitive results on Dynamic-SUPERB and AIR-Bench-Chat without task-specific instruction data and preserves the LLM’s reasoning abilities, while reducing annotation effort and forgetting risk. The work suggests a data-efficient path toward universal speech language models and provides a practical alternative to multi-stage instruction-tuning pipelines.

Abstract

Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap between speech and text modalities. This requires significant annotation efforts and risks catastrophic forgetting of the original language capabilities. In this work, we present a simple yet effective automatic process for creating speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data, achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our model exhibits the ability to follow complex instructions derived from LLMs, such as specific output formatting and chain-of-thought reasoning. Our approach not only enhances the versatility and effectiveness of SLMs but also reduces reliance on extensive annotated datasets, paving the way for more efficient and capable speech understanding systems.

DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

TL;DR

This paper challenges the notion that speech instruction-tuning data is necessary to build instruction-following speech language models. It introduces DeSTA2, a data-construction framework that uses seed transcripts and a single LLM prompt to generate descriptive speech captions from rich speech metadata, training a frozen Whisper-Llama3 pipeline with a lightweight modality adaptor. The approach yields competitive results on Dynamic-SUPERB and AIR-Bench-Chat without task-specific instruction data and preserves the LLM’s reasoning abilities, while reducing annotation effort and forgetting risk. The work suggests a data-efficient path toward universal speech language models and provides a practical alternative to multi-stage instruction-tuning pipelines.

Abstract

Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap between speech and text modalities. This requires significant annotation efforts and risks catastrophic forgetting of the original language capabilities. In this work, we present a simple yet effective automatic process for creating speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data, achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our model exhibits the ability to follow complex instructions derived from LLMs, such as specific output formatting and chain-of-thought reasoning. Our approach not only enhances the versatility and effectiveness of SLMs but also reduces reliance on extensive annotated datasets, paving the way for more efficient and capable speech understanding systems.
Paper Structure (15 sections, 1 figure, 5 tables)

This paper contains 15 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: (Left)Dataset construction We feed seed transcript and prompt to generate response as training target. (Right)Model training The end-to-end model learns to generate same response based on speech features and text transcription.