Table of Contents
Fetching ...

SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

Prabhat Pandey, Rupak Vignesh Swaminathan, K V Vijay Girish, Arunasish Sen, Jian Xie, Grant P. Strimel, Andreas Schwarz

TL;DR

The paper presents SIFT-50M, a large-scale multilingual dataset (50M examples) for instruction fine-tuning and pre-training of speech-text LLMs, addressing the lack of diverse, multilingual instruction data in speech tasks. It describes a complete data pipeline—metadata extraction, multilingual instruction generation across closed-ended, open-ended, and controllable generation, and rigorous quality assurance—and develops SIFT-LLM, a speech-text LLM trained via continued pre-training and instruction fine-tuning on SIFT-50M. A new EvalSIFT benchmark suite is introduced to systematically evaluate instruction-following and controllable generation across languages. Empirical results show SIFT-LLM achieves strong instruction-following performance, competitive results on foundational speech tasks, and demonstrates controllable speech generation, while ablations reveal important trade-offs between data volume, pre-training, and task mix. The work provides scalable resources and benchmarks to advance multilingual speech instruction tuning and highlights directions for balancing instruction-following with robust speech understanding.

Abstract

We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.

SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

TL;DR

The paper presents SIFT-50M, a large-scale multilingual dataset (50M examples) for instruction fine-tuning and pre-training of speech-text LLMs, addressing the lack of diverse, multilingual instruction data in speech tasks. It describes a complete data pipeline—metadata extraction, multilingual instruction generation across closed-ended, open-ended, and controllable generation, and rigorous quality assurance—and develops SIFT-LLM, a speech-text LLM trained via continued pre-training and instruction fine-tuning on SIFT-50M. A new EvalSIFT benchmark suite is introduced to systematically evaluate instruction-following and controllable generation across languages. Empirical results show SIFT-LLM achieves strong instruction-following performance, competitive results on foundational speech tasks, and demonstrates controllable speech generation, while ablations reveal important trade-offs between data volume, pre-training, and task mix. The work provides scalable resources and benchmarks to advance multilingual speech instruction tuning and highlights directions for balancing instruction-following with robust speech understanding.

Abstract

We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.

Paper Structure

This paper contains 40 sections, 4 figures, 18 tables.

Figures (4)

  • Figure 1: Block diagram showing the stages of SIFT-50M dataset construction. For non-English data generation, we substitute the metadata mapping with the respective language and prompt the LLM to generate responses in that language.
  • Figure 2: Effect of SIFT data volume used during instruction fine-tuning on SIFT-LLM's performance, as measured on DS-1, AIR-Bench Chat, and EvalSIFT.
  • Figure 3: Dataset distribution showing the multi-lingual nature of SIFT-50M and the different categories within each language.
  • Figure 4: Distribution of the number of examples per language in the acoustic-based language ID (LID) task that is part of the closed-ended instructions.