Table of Contents
Fetching ...

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Ziyang Ma, Shengpeng Ji, Jialong Zuo, Tao Jin, Zhou Zhao

TL;DR

Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries, is introduced, which blends query features from different modalities during training.

Abstract

The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries. Specifically, we introduce the Query-Mixup strategy, which blends query features from different modalities during training. This enables OmniSep to optimize multiple modalities concurrently, effectively bringing all modalities under a unified framework for sound separation. We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating the retention or removal of specific sounds as desired. Finally, OmniSep employs a retrieval-augmented approach known as Query-Aug, which enables open-vocabulary sound separation. Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of OmniSep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks. For samples and further information, please visit the demo page at \url{https://omnisep.github.io/}.

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

TL;DR

Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries, is introduced, which blends query features from different modalities during training.

Abstract

The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries. Specifically, we introduce the Query-Mixup strategy, which blends query features from different modalities during training. This enables OmniSep to optimize multiple modalities concurrently, effectively bringing all modalities under a unified framework for sound separation. We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating the retention or removal of specific sounds as desired. Finally, OmniSep employs a retrieval-augmented approach known as Query-Aug, which enables open-vocabulary sound separation. Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of OmniSep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks. For samples and further information, please visit the demo page at \url{https://omnisep.github.io/}.

Paper Structure

This paper contains 29 sections, 5 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Illustration of Omni-modal Sound Separation (OmniSep). The OmniSep employs the parameter-frozen ImageBind model to extract features from diverse modal queries, donated as $\mathbf{Q}_T$, $\mathbf{Q}_V$, and $\mathbf{Q}_A$ within the figure. Negative query $N_1$, which aligns semantically with the interference audio $A_2$, is adopted to aid sound separation during inference. Note that during testing time for IQSS, TQSS, and AQSS, only a single modal query is employed.
  • Figure 2: The variation of SDR with the negative query weight $\alpha$ on VGGSOUND-CLEAN+ and MUSIC-CLEAN+. The x-axis represents the weight $\alpha$ of the negative query, while the y-axis denotes the SDR. The shaded area indicates the standard deviation of the SDR. The dashed and solid lines respectively represent the results of naive subtraction and our proposed method.
  • Figure 3: UMAP visualization of three different modal imagebind embeddings. The mix embedding is a weighted embedding of the three modalities.