MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji
TL;DR
MMAudioSep tackles video/text-queried sound separation by leveraging a pretrained video-to-audio generation model. It fine-tunes the model with channel-concatenation conditioning and a conditional flow matching objective $L_\text{CFM}(\theta)$, injecting the mixture latent via input channels and using an ODE-based flow from noise $x_0$ to target latent over $t \in [0,1]$. The approach achieves competitive or superior performance compared to baselines like AudioSep and FlowSep on VGGSound-Clean and MUSIC, while preserving the model's video-to-audio generation capability after fine-tuning, especially when pretraining is leveraged with selective freezing. This work demonstrates that foundational multimodal sound generation models can be repurposed for downstream sound separation tasks, suggesting potential for universal separation and broader multimodal audio applications.
Abstract
We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.
