Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving

Jingran Xie; Xiang Li; Hui Wang; Yue Yu; Yang Xiang; Xixin Wu; Zhiyong Wu

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving

Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu

TL;DR

This work tackles the limited generalization of speech large language models by introducing MTBI, a multi-task behavior imitation framework that trains a speech LLM to imitate a text LLM’s responses using only paired speech and transcripts. It couples a frozen speech encoder with a trainable connector and a frozen LLM to align speech with the LLM’s textual space, and augments training with interleaving of speech and text to improve cross-modal alignment. The authors introduce a generalization benchmark assessing prompt and task generalization and demonstrate that MTBI achieves or surpasses state-of-the-art SLLMs while using far less supervised speech data, aided by constructed content tasks and a robust ablation study. The approach shows strong potential for robust, generalizable SLLMs and lays groundwork for incorporating nonlinguistic speech features in future work.

Abstract

Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these issues, we propose a novel multi-task 'behavior imitation' method with speech-text interleaving, called MTBI, which relies solely on paired speech and transcripts. By ensuring the LLM decoder generates equivalent responses to paired speech and text, we achieve a more generalized SLLM. Interleaving is used to further enhance alignment efficiency. We introduce a simple benchmark to evaluate prompt and task generalization across different models. Experimental results demonstrate that our MTBI outperforms SOTA SLLMs on both prompt and task generalization, while requiring less supervised speech data.

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving

TL;DR

Abstract

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)