COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning
Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li
TL;DR
COSMIC presents a data-efficient approach to endow a text-based LLM with speech in-context learning by using GPT-3.5-generated SQA data and training a compact multi-modal model with only ~30M trainable parameters on ~450 hours of English speech. The architecture fuses a pre-trained acoustic encoder, a windowed QFormer, and a frozen LLM augmented with LoRA adapters, allowing zero-shot and few-shot EN→X S2TT and cross-domain adaptation. Key contributions include the data-efficient instruction-tuning framework, emergent speech in-context capabilities, and demonstrated gains in ASR, S2TT, and contextual biasing across in-domain and cross-domain tasks. The results suggest a practical, cost-effective pathway to integrate speech modalities into LLMs for versatile cross-lingual and context-aware speech understanding and translation applications.
Abstract
We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters and only 450 hours of English speech data, COSMIC demonstrates emerging capabilities in instruction-following and in-context learning. Equipped with such capabilities, COSMIC achieves a maximum 33.18 BLEU score in 0-shot EN-to-X speech to text translation (S2TT) and a significant boost in the 1-shot setting. Additionally, there is an average 25.8\% relative Word Error Rate (WER) reduction for 1-shot cross-domain adaptation. COSMIC exhibits a significant automatic speech recognition (ASR) accuracy gain in contextual biasing tasks due to its instruction-following capability.
