Table of Contents
Fetching ...

Self-Powered LLM Modality Expansion for Large Speech-Text Models

Tengfei Yu, Xuebo Liu, Zhiyi Hou, Liang Ding, Dacheng Tao, Min Zhang

TL;DR

This work tackles the challenge of expanding LLMs with speech by introducing speech anchor bias, where models over-attend to speech rather than textual instructions during training. It proposes a self-powered augmentation approach that generates instruction-driven data from the model itself, freezes the speech encoder, and fine-tunes the Q-Former and LLM to align with instructions. Across ASR, ST, SLU, and QA tasks, the Self-Powered LSM reduces bias, improves speech-text fusion, and maintains strong textual performance, demonstrating robust generalization. The method offers a practical path to scalable, end-to-end multimodal models and includes public release of the augmentation dataset for community use.

Abstract

Large language models (LLMs) exhibit remarkable performance across diverse tasks, indicating their potential for expansion into large speech-text models (LSMs) by integrating speech capabilities. Although unified speech-text pre-training and multimodal data instruction-tuning offer considerable benefits, these methods generally entail significant resource demands and tend to overfit specific tasks. This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning. We explore the instruction-following dynamics within LSMs, identifying a critical issue termed speech anchor bias-a tendency for LSMs to over-rely on speech inputs, mistakenly interpreting the entire speech modality as directives, thereby neglecting textual instructions. To counteract this bias, we introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning. Our experiments across a range of speech-based tasks demonstrate that self-powered LSM mitigates speech anchor bias and improves the fusion of speech and text modalities in LSMs. Data, code and scripts are freely available at https://github.com/ytf-philp/Self-powered-LSM.

Self-Powered LLM Modality Expansion for Large Speech-Text Models

TL;DR

This work tackles the challenge of expanding LLMs with speech by introducing speech anchor bias, where models over-attend to speech rather than textual instructions during training. It proposes a self-powered augmentation approach that generates instruction-driven data from the model itself, freezes the speech encoder, and fine-tunes the Q-Former and LLM to align with instructions. Across ASR, ST, SLU, and QA tasks, the Self-Powered LSM reduces bias, improves speech-text fusion, and maintains strong textual performance, demonstrating robust generalization. The method offers a practical path to scalable, end-to-end multimodal models and includes public release of the augmentation dataset for community use.

Abstract

Large language models (LLMs) exhibit remarkable performance across diverse tasks, indicating their potential for expansion into large speech-text models (LSMs) by integrating speech capabilities. Although unified speech-text pre-training and multimodal data instruction-tuning offer considerable benefits, these methods generally entail significant resource demands and tend to overfit specific tasks. This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning. We explore the instruction-following dynamics within LSMs, identifying a critical issue termed speech anchor bias-a tendency for LSMs to over-rely on speech inputs, mistakenly interpreting the entire speech modality as directives, thereby neglecting textual instructions. To counteract this bias, we introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning. Our experiments across a range of speech-based tasks demonstrate that self-powered LSM mitigates speech anchor bias and improves the fusion of speech and text modalities in LSMs. Data, code and scripts are freely available at https://github.com/ytf-philp/Self-powered-LSM.
Paper Structure (47 sections, 8 equations, 6 figures, 12 tables)

This paper contains 47 sections, 8 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Model architecture of LSM.
  • Figure 2: The left shows a well-trained LSM should possess the capability to follow instructions, whereas the right displays directly fine-tuned model with speech instructional data does not enable the acquisition of speech modality expansion capability.
  • Figure 3: The comparison of the layer-wise behavior in instruction-following LLM versus instruction-ignoring LSM."Source" refers to text input for LLMs, whereas denotes speech input for LSM. As the layer deepens, the proportion of instructions diminishes in LSM while increasing in LLM. The red borders show that LSMs excessively focus on speech representations and ignore instructions.
  • Figure 4: Process of self-powered data augmentation: Self-Powered data is generated by prompting the LLM with instructions alongside the text from the vanilla ASR dataset. The self-powered data is then used to train the LSM.
  • Figure 5: Layer-wise behaviors in self-powered LSM.
  • ...and 1 more figures