OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation

Tanvir Mahmud; Diana Marculescu

OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation

Tanvir Mahmud, Diana Marculescu

TL;DR

OpenSep is proposed, a novel framework that leverages large language models for automated audio separation, eliminating the need for manual intervention and overcoming source limitations, and introduces a multi-level extension of the mix-and-separate training framework to enhance modality alignment.

Abstract

Audio separation in real-world scenarios, where mixtures contain a variable number of sources, presents significant challenges due to limitations of existing models, such as over-separation, under-separation, and dependence on predefined training sources. We propose OpenSep, a novel framework that leverages large language models (LLMs) for automated audio separation, eliminating the need for manual intervention and overcoming source limitations. OpenSep uses textual inversion to generate captions from audio mixtures with off-the-shelf audio captioning models, effectively parsing the sound sources present. It then employs few-shot LLM prompting to extract detailed audio properties of each parsed source, facilitating separation in unseen mixtures. Additionally, we introduce a multi-level extension of the mix-and-separate training framework to enhance modality alignment by separating single source sounds and mixtures simultaneously. Extensive experiments demonstrate OpenSep's superiority in precisely separating new, unseen, and variable sources in challenging mixtures, outperforming SOTA baseline methods. Code is released at https://github.com/tanvir-utexas/OpenSep.git

OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation

TL;DR

Abstract

Paper Structure (37 sections, 7 figures, 9 tables)

This paper contains 37 sections, 7 figures, 9 tables.

Introduction
Related Work
Unconditional Sound Separation
Conditional Sound Separation
Methodology
Source Parsing with Textual Inversion
Textual Inversion:
Source parser LLM:
Knowledge Parsing for Each Source
Text-Conditioned Audio Separator
Proposed Training Pipeline
Results
Evaluation Setup
Dataset:
Implementation Details:
...and 22 more sections

Figures (7)

Figure 1: Un-conditional audio separators suffer from both over-separation and under-separation in noisy mixtures, and cannot parse audio entities without additional classifiers. Furthermore, conditional separators rely on manual text prompts for source separation, limiting their use in practice. In contrast, OpenSep fully automates the source parsing and separation flow, even with varying number of unseen and noisy sources in open world.
Figure 2: Proposed OpenSep pipeline: We initially apply textual inversion on noisy audio mixtures with an off-the-self audio captioning model to extract text descriptions. Afterwards, a pre-trained instruction-tuned LLM is used to parse audio sources from the caption, followed by the extraction of detailed audio properties of each source. Finally, a text-conditional audio-separator is used for separating each audio source from the noisy mixture using the enriched text prompts. Here, the audio separator is trained for leveraging detailed audio properties in textual representation.
Figure 3: Proposed training pipeline: We extend the baseline mix-and-separate framework with multi-order separation objective for enhanced modality-alignment. Initially, we sample four independent single source sounds, and prepare synthetic mixtures of two and four sources. We parse enriched text prompts for mixtures and single-source sounds with a knowledge parser LLM. The audio separator is trained to separate both single-source and lower-order mixtures based on enriched text guidance using an L1 loss objective.
Figure 4: Qualitative results on natural mixtures from AudioCaps. (Left) All baselines show large spectral overlap of woman talking sound in other two source predictions. OpenSep precisely disentangles all three sources minimizing the spectral overlap across sources, while preserving spectral details. (Right) For the dominant noisy sound of children yelling, all baselines can hardly separate the woman talking sound. OpenSep significantly reduces noise in woman talking, while preserving spectral details of noisy children yelling sound.
Figure 5: Qualitative results on natural mixtures from AudioCaps. (Left) We can observe the dominant "woman talks" spectral content in "frying foods" for most baselines. However, in CLIPSep, such overlap is largely reduced, but horizontal spectral contents from "music plays" is visible. In contrast, OpenSep largely reduces such spectral overlap in all three components while preserving all details. (Right) In this mixture, the "beep sound" is only present at the beginning, with large noisy sound of "wind blows" over the spectrogram. Most baseline methods contain noisy spectral contents in the "beep sound", while losing spectral contents in the "wind blows" prediction. In contrast, OpenSep disentangles this noisy mixture with significant reduction of spectral overlaps.
...and 2 more figures

OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation

TL;DR

Abstract

OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)