Table of Contents
Fetching ...

MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

Zihan Zhang, Xize Cheng, Zhennan Jiang, Dongjie Fu, Jingyuan Chen, Zhou Zhao, Tao Jin

TL;DR

MARS-Sep rethinks universal sound separation by framing mask prediction as reinforcement learning guided by multimodal rewards, addressing the mismatch between signal metrics and semantic fidelity. It introduces a factorized Beta policy over time-frequency bins and a clipped trust-region objective, with rewards derived from a progressively aligned ImageBind encoder that fuses audio, text, and visual cues. A three-stage, progressive fine-tuning curriculum aligns multimodal representations, improving reward faithfulness and stability during RL. Empirical results on VGGSOUND-clean+ and MUSIC-clean+ show consistent improvements in both signal metrics (e.g., SDR, SIR, SAR, SI-SDRi) and semantic alignment (CLAP), demonstrating robust semantically-aware separation across text, audio, and image queries. The work offers a practical path toward open-domain, semantically guided separation with strong cross-modal grounding and stable training dynamics.

Abstract

Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. To bridge this gap, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate with entropy regularization and group-relative advantage normalization. Concretely, we sample masks from a frozen old policy, reconstruct waveforms, and update the current policy using clipped importance ratios-yielding substantially more stable and sample-efficient learning. Multimodal rewards, derived from an audio-text-vision encoder, directly incentivize semantic consistency with query prompts. We further propose a progressive alignment scheme to fine-tune this encoder, boosting its cross-modal discriminability and improving reward faithfulness. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://anonymous.4open.science/r/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.

MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

TL;DR

MARS-Sep rethinks universal sound separation by framing mask prediction as reinforcement learning guided by multimodal rewards, addressing the mismatch between signal metrics and semantic fidelity. It introduces a factorized Beta policy over time-frequency bins and a clipped trust-region objective, with rewards derived from a progressively aligned ImageBind encoder that fuses audio, text, and visual cues. A three-stage, progressive fine-tuning curriculum aligns multimodal representations, improving reward faithfulness and stability during RL. Empirical results on VGGSOUND-clean+ and MUSIC-clean+ show consistent improvements in both signal metrics (e.g., SDR, SIR, SAR, SI-SDRi) and semantic alignment (CLAP), demonstrating robust semantically-aware separation across text, audio, and image queries. The work offers a practical path toward open-domain, semantically guided separation with strong cross-modal grounding and stable training dynamics.

Abstract

Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. To bridge this gap, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate with entropy regularization and group-relative advantage normalization. Concretely, we sample masks from a frozen old policy, reconstruct waveforms, and update the current policy using clipped importance ratios-yielding substantially more stable and sample-efficient learning. Multimodal rewards, derived from an audio-text-vision encoder, directly incentivize semantic consistency with query prompts. We further propose a progressive alignment scheme to fine-tune this encoder, boosting its cross-modal discriminability and improving reward faithfulness. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://anonymous.4open.science/r/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.

Paper Structure

This paper contains 37 sections, 15 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The reinforcement learning loop of MARS-Sep. The separator generates stochastic mask actions from a Beta-distributed policy, while a frozen snapshot serves as the old policy for stable optimization. Multimodal rewards derived from audio, text, and visual embeddings guide policy updates, with entropy and KL regularization enhancing exploration and stability.
  • Figure 2: Progressive fine-tuning strategy for sound source discrimination and separation. Encoders remain frozen while task-specific heads are gradually unfrozen and each stage builds on the best checkpoint from the previous one. The two latter stages are trained with a fraction of the former aligned paired data to avoid catastrophic forgetting.
  • Figure 3: Log-mel spectrograms of separated audio from different query modalities on VGGSOUND-clean+ dataset. The target source is "cattle bovinae cowbell". From left to right: (a) Mixture of "cattle bovinae cowbell" and "tap dancing"; (b) Ground-truth "cattle bovinae cowbell"; (c) Interference "tap dancing"; (d) Separation with text query by the baseline model; (e) Separation with text query by our model.
  • Figure 4: Qualitative comparison of separation results in the TQSS setting. Each group contains 5 spectrograms: mixed input, target source, interference source, baseline(OmniSep) separation, and our method separation.