Table of Contents
Fetching ...

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving

Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu

TL;DR

This work tackles the limited generalization of speech large language models by introducing MTBI, a multi-task behavior imitation framework that trains a speech LLM to imitate a text LLM’s responses using only paired speech and transcripts. It couples a frozen speech encoder with a trainable connector and a frozen LLM to align speech with the LLM’s textual space, and augments training with interleaving of speech and text to improve cross-modal alignment. The authors introduce a generalization benchmark assessing prompt and task generalization and demonstrate that MTBI achieves or surpasses state-of-the-art SLLMs while using far less supervised speech data, aided by constructed content tasks and a robust ablation study. The approach shows strong potential for robust, generalizable SLLMs and lays groundwork for incorporating nonlinguistic speech features in future work.

Abstract

Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these issues, we propose a novel multi-task 'behavior imitation' method with speech-text interleaving, called MTBI, which relies solely on paired speech and transcripts. By ensuring the LLM decoder generates equivalent responses to paired speech and text, we achieve a more generalized SLLM. Interleaving is used to further enhance alignment efficiency. We introduce a simple benchmark to evaluate prompt and task generalization across different models. Experimental results demonstrate that our MTBI outperforms SOTA SLLMs on both prompt and task generalization, while requiring less supervised speech data.

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving

TL;DR

This work tackles the limited generalization of speech large language models by introducing MTBI, a multi-task behavior imitation framework that trains a speech LLM to imitate a text LLM’s responses using only paired speech and transcripts. It couples a frozen speech encoder with a trainable connector and a frozen LLM to align speech with the LLM’s textual space, and augments training with interleaving of speech and text to improve cross-modal alignment. The authors introduce a generalization benchmark assessing prompt and task generalization and demonstrate that MTBI achieves or surpasses state-of-the-art SLLMs while using far less supervised speech data, aided by constructed content tasks and a robust ablation study. The approach shows strong potential for robust, generalizable SLLMs and lays groundwork for incorporating nonlinguistic speech features in future work.

Abstract

Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these issues, we propose a novel multi-task 'behavior imitation' method with speech-text interleaving, called MTBI, which relies solely on paired speech and transcripts. By ensuring the LLM decoder generates equivalent responses to paired speech and text, we achieve a more generalized SLLM. Interleaving is used to further enhance alignment efficiency. We introduce a simple benchmark to evaluate prompt and task generalization across different models. Experimental results demonstrate that our MTBI outperforms SOTA SLLMs on both prompt and task generalization, while requiring less supervised speech data.

Paper Structure

This paper contains 19 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Overview of the SLLM architecture. The training process is conducted in two stages using the same text LLM. In stage 1, we use the LLM to generate responses based on the task prompts and transcripts of speech data. Then, we train the SLLM model with behavior imitation that use the same task prompt and corresponding speech (or interleaved speech) to predict the generated response of the first stage. In stage 2, we only train the connector to align the speech features into textual space.