Table of Contents
Fetching ...

UniSep: Universal Target Audio Separation with Language Models at Scale

Yuanyuan Wang, Hangting Chen, Dongchao Yang, Weiqin Li, Dan Luo, Guangzhi Li, Shan Yang, Zhiyong Wu, Helen Meng, Xixin Wu

TL;DR

UniSep addresses universal target audio separation across unlimited domains by modeling audio as discrete tokens via SoundStream and applying a causal decoder-only language model to perform sequence-to-sequence separation. It introduces two audio-only pre-training tasks, Audio Continuation and Audio Inpaint, to enhance consistency and cross-domain relevance, enabling effective separation with large-scale data (about $36{,}5k$ hours) and a model with $535$ million parameters trained on 8 V100 GPUs. The approach demonstrates strong performance across speech, sound, and music tasks, with pre-training providing clear gains and the unified model outperforming single-task baselines; it can also be fine-tuned for downstream tasks such as language-queried audio source separation, highlighting its potential as a foundation model for audio understanding. This work underscores the practicality and impact of scaling data and leveraging LM architectures for universal audio processing, potentially simplifying cross-domain separation pipelines and enabling flexible, description-driven audio extraction.

Abstract

We propose Universal target audio Separation (UniSep), addressing the separation task on arbitrary mixtures of different types of audio. Distinguished from previous studies, UniSep is performed on unlimited source domains and unlimited source numbers. We formulate the separation task as a sequence-to-sequence problem, and a large language model (LLM) is used to model the audio sequence in the discrete latent space, leveraging the power of LLM in handling complex mixture audios with large-scale data. Moreover, a novel pre-training strategy is proposed to utilize audio-only data, which reduces the efforts of large-scale data simulation and enhances the ability of LLMs to understand the consistency and correlation of information within audio sequences. We also demonstrate the effectiveness of scaling datasets in an audio separation task: we use large-scale data (36.5k hours), including speech, music, and sound, to train a universal target audio separation model that is not limited to a specific domain. Experiments show that UniSep achieves competitive subjective and objective evaluation results compared with single-task models.

UniSep: Universal Target Audio Separation with Language Models at Scale

TL;DR

UniSep addresses universal target audio separation across unlimited domains by modeling audio as discrete tokens via SoundStream and applying a causal decoder-only language model to perform sequence-to-sequence separation. It introduces two audio-only pre-training tasks, Audio Continuation and Audio Inpaint, to enhance consistency and cross-domain relevance, enabling effective separation with large-scale data (about hours) and a model with million parameters trained on 8 V100 GPUs. The approach demonstrates strong performance across speech, sound, and music tasks, with pre-training providing clear gains and the unified model outperforming single-task baselines; it can also be fine-tuned for downstream tasks such as language-queried audio source separation, highlighting its potential as a foundation model for audio understanding. This work underscores the practicality and impact of scaling data and leveraging LM architectures for universal audio processing, potentially simplifying cross-domain separation pipelines and enabling flexible, description-driven audio extraction.

Abstract

We propose Universal target audio Separation (UniSep), addressing the separation task on arbitrary mixtures of different types of audio. Distinguished from previous studies, UniSep is performed on unlimited source domains and unlimited source numbers. We formulate the separation task as a sequence-to-sequence problem, and a large language model (LLM) is used to model the audio sequence in the discrete latent space, leveraging the power of LLM in handling complex mixture audios with large-scale data. Moreover, a novel pre-training strategy is proposed to utilize audio-only data, which reduces the efforts of large-scale data simulation and enhances the ability of LLMs to understand the consistency and correlation of information within audio sequences. We also demonstrate the effectiveness of scaling datasets in an audio separation task: we use large-scale data (36.5k hours), including speech, music, and sound, to train a universal target audio separation model that is not limited to a specific domain. Experiments show that UniSep achieves competitive subjective and objective evaluation results compared with single-task models.

Paper Structure

This paper contains 14 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: (a) gives the illustration of two pre-training tasks. (b) shows the Sequence layout for UniSep. We encode all input into the discrete token space so that we can directly use language model architectures for audio separation.
  • Figure 2: Visualization results on language-queried audio source separation.