Table of Contents
Fetching ...

Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models

Dehua Tao, Xuan Luo, Daxin Tan, Kai Chen, Lanqing Hong, Jing Li, Ruifeng Xu, Xiao Chen

TL;DR

Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data.

Abstract

While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones' vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training. Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data. Furthermore, the learned speech modules exhibit strong transferability across VL backbones.

Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models

TL;DR

Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data.

Abstract

While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones' vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training. Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data. Furthermore, the learned speech modules exhibit strong transferability across VL backbones.
Paper Structure (27 sections, 5 figures, 6 tables)

This paper contains 27 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of training cost and question answering accuracy across different models. Speech hours are calculated by aggregating all speech data used to endow models with speech understanding and generation capabilities. Ours achieves competitive performance with approximately one-tenth of the training cost.
  • Figure 2: The architecture of Speech-Omni-Lite. It comprises a pre-trained discrete speech tokenizer, a trainable speech projector, a pre-trained large VL model, a trainable speech token generator, and a pre-trained speech de-tokenizer.
  • Figure 3: The training strategy of speech token generator. In the first stage, the text projector is trained with VL backbone frozen. In the second stage, the speech token generator is trained with other modules frozen. Both stages leverage the QTATS data.
  • Figure 4: Architecture of streaming discrete speech tokenizer.
  • Figure 5: Architecture of CA-DiT block of speech de-tokenizer.