Audio-FLAN: A Preliminary Release

Liumeng Xue; Ziya Zhou; Jiahao Pan; Zixuan Li; Shuai Fan; Yinghao Ma; Sitong Cheng; Dongchao Yang; Haohan Guo; Yujia Xiao; Xinsheng Wang; Zixuan Shen; Chuanbo Zhu; Xinshen Zhang; Tianchi Liu; Ruibin Yuan; Zeyue Tian; Haohe Liu; Emmanouil Benetos; Ge Zhang; Yike Guo; Wei Xue

Audio-FLAN: A Preliminary Release

Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

TL;DR

Audio-FLAN addresses the lack of a unified, broad-capability audio-language model by introducing a large-scale instruction-tuning corpus across speech, music, and sound. The approach combines data collection from 52 public datasets, a standardized JSONL task format, and an instruction variation pipeline to produce diverse instruction-input-output pairs for both understanding and generation tasks. Key contributions include 23 major tasks, 80 minor tasks, and 108.5M instances, with a clear emphasis on enabling zero-shot generalization across audio domains. The work lays the groundwork for unified audio-language models and invites community collaboration to expand tasks, domains, and conversational capabilities.

Abstract

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.

Audio-FLAN: A Preliminary Release

TL;DR

Abstract

Audio-FLAN: A Preliminary Release

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)