Table of Contents
Fetching ...

MSRBench: A Benchmarking Dataset for Music Source Restoration

Yongyi Zang, Jiarui Hai, Wanying Ge, Qiuqiang Kong, Zheqi Dai, Helin Wang, Yuki Mitsufuji, Mark D. Plumbley

TL;DR

MSRBench tackles the challenge of restoring unprocessed music signals degraded by real-world production effects and codecs. The authors provide the first benchmark with true unprocessed ground-truth stems and professionally mixed outputs across eight instrument classes, augmented with 12 degradations, enabling evaluation with both SI-SNR and perceptual metrics like FAD$_{CLAP}$. Baseline experiments show a large gap between reconstruction fidelity and perceptual quality, highlighting phase estimation as a fundamental bottleneck and suggesting the need for phase-aware, instrument-specific restoration methods. By releasing the dataset and baselines, the work establishes a practical benchmark to drive progress in music source restoration and motivates new evaluation protocols that reflect perceptual usefulness over exact signal reconstruction.

Abstract

Music Source Restoration (MSR) extends source separation to realistic settings where signals undergo production effects (equalization, compression, reverb) and real-world degradations, with the goal of recovering the original unprocessed sources. Existing benchmarks cannot measure restoration fidelity: synthetic datasets use unprocessed stems but unrealistic mixtures, while real production datasets provide only already-processed stems without clean references. We present MSRBench, the first benchmark explicitly designed for MSR evaluation. MSRBench contains raw stem-mixture pairs across eight instrument classes, where mixtures are produced by professional mixing engineers. These raw-processed pairs enable direct evaluation of both separation accuracy and restoration fidelity. Beyond controlled studio conditions, the mixtures are augmented with twelve real-world degradations spanning analog artifacts, acoustic environments, and lossy codecs. Baseline experiments with U-Net and BSRNN achieve SI-SNR of -37.8 dB and -23.4 dB respectively, with perceptual quality (FAD CLAP) around 0.7-0.8, demonstrating substantial room for improvement and the need for restoration-specific architectures.

MSRBench: A Benchmarking Dataset for Music Source Restoration

TL;DR

MSRBench tackles the challenge of restoring unprocessed music signals degraded by real-world production effects and codecs. The authors provide the first benchmark with true unprocessed ground-truth stems and professionally mixed outputs across eight instrument classes, augmented with 12 degradations, enabling evaluation with both SI-SNR and perceptual metrics like FAD. Baseline experiments show a large gap between reconstruction fidelity and perceptual quality, highlighting phase estimation as a fundamental bottleneck and suggesting the need for phase-aware, instrument-specific restoration methods. By releasing the dataset and baselines, the work establishes a practical benchmark to drive progress in music source restoration and motivates new evaluation protocols that reflect perceptual usefulness over exact signal reconstruction.

Abstract

Music Source Restoration (MSR) extends source separation to realistic settings where signals undergo production effects (equalization, compression, reverb) and real-world degradations, with the goal of recovering the original unprocessed sources. Existing benchmarks cannot measure restoration fidelity: synthetic datasets use unprocessed stems but unrealistic mixtures, while real production datasets provide only already-processed stems without clean references. We present MSRBench, the first benchmark explicitly designed for MSR evaluation. MSRBench contains raw stem-mixture pairs across eight instrument classes, where mixtures are produced by professional mixing engineers. These raw-processed pairs enable direct evaluation of both separation accuracy and restoration fidelity. Beyond controlled studio conditions, the mixtures are augmented with twelve real-world degradations spanning analog artifacts, acoustic environments, and lossy codecs. Baseline experiments with U-Net and BSRNN achieve SI-SNR of -37.8 dB and -23.4 dB respectively, with perceptual quality (FAD CLAP) around 0.7-0.8, demonstrating substantial room for improvement and the need for restoration-specific architectures.

Paper Structure

This paper contains 12 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of two baseline architectures. Both networks process complex mel spectrograms: The U-Net employs cascading upsampling and downsampling layers with skip connections, while the BSRNN uses grouped RNN networks operating in parallel to alternately model the temporal and frequency axes. The frequency-axis modeling is bidirectional, whereas temporal modeling is unidirectional (causal).
  • Figure 2: Mel spectrogram examples showing mixture, target, and predictions from both baseline models. Despite poor SI-SNR values, predictions exhibit mostly correct spectral structure and temporal evolution, with primary artifacts appearing as reduced fine-grained detail, incomplete instrumental removal rather than catastrophic distortion. Notably, vocal estimation is almost completely correct, yet scores very low SI-SNR.