SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Prabhat Pandey, Rupak Vignesh Swaminathan, K V Vijay Girish, Arunasish Sen, Jian Xie, Grant P. Strimel, Andreas Schwarz
TL;DR
The paper presents SIFT-50M, a large-scale multilingual dataset (50M examples) for instruction fine-tuning and pre-training of speech-text LLMs, addressing the lack of diverse, multilingual instruction data in speech tasks. It describes a complete data pipeline—metadata extraction, multilingual instruction generation across closed-ended, open-ended, and controllable generation, and rigorous quality assurance—and develops SIFT-LLM, a speech-text LLM trained via continued pre-training and instruction fine-tuning on SIFT-50M. A new EvalSIFT benchmark suite is introduced to systematically evaluate instruction-following and controllable generation across languages. Empirical results show SIFT-LLM achieves strong instruction-following performance, competitive results on foundational speech tasks, and demonstrates controllable speech generation, while ablations reveal important trade-offs between data volume, pre-training, and task mix. The work provides scalable resources and benchmarks to advance multilingual speech instruction tuning and highlights directions for balancing instruction-following with robust speech understanding.
Abstract
We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.
