MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

Zihan Zhang; Xize Cheng; Zhennan Jiang; Dongjie Fu; Jingyuan Chen; Zhou Zhao; Tao Jin

MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

Zihan Zhang, Xize Cheng, Zhennan Jiang, Dongjie Fu, Jingyuan Chen, Zhou Zhao, Tao Jin

TL;DR

MARS-Sep rethinks universal sound separation by framing mask prediction as reinforcement learning guided by multimodal rewards, addressing the mismatch between signal metrics and semantic fidelity. It introduces a factorized Beta policy over time-frequency bins and a clipped trust-region objective, with rewards derived from a progressively aligned ImageBind encoder that fuses audio, text, and visual cues. A three-stage, progressive fine-tuning curriculum aligns multimodal representations, improving reward faithfulness and stability during RL. Empirical results on VGGSOUND-clean+ and MUSIC-clean+ show consistent improvements in both signal metrics (e.g., SDR, SIR, SAR, SI-SDRi) and semantic alignment (CLAP), demonstrating robust semantically-aware separation across text, audio, and image queries. The work offers a practical path toward open-domain, semantically guided separation with strong cross-modal grounding and stable training dynamics.

Abstract

Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. To bridge this gap, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate with entropy regularization and group-relative advantage normalization. Concretely, we sample masks from a frozen old policy, reconstruct waveforms, and update the current policy using clipped importance ratios-yielding substantially more stable and sample-efficient learning. Multimodal rewards, derived from an audio-text-vision encoder, directly incentivize semantic consistency with query prompts. We further propose a progressive alignment scheme to fine-tune this encoder, boosting its cross-modal discriminability and improving reward faithfulness. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://anonymous.4open.science/r/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.

MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

TL;DR

Abstract

MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)