MSRBench: A Benchmarking Dataset for Music Source Restoration
Yongyi Zang, Jiarui Hai, Wanying Ge, Qiuqiang Kong, Zheqi Dai, Helin Wang, Yuki Mitsufuji, Mark D. Plumbley
TL;DR
MSRBench tackles the challenge of restoring unprocessed music signals degraded by real-world production effects and codecs. The authors provide the first benchmark with true unprocessed ground-truth stems and professionally mixed outputs across eight instrument classes, augmented with 12 degradations, enabling evaluation with both SI-SNR and perceptual metrics like FAD$_{CLAP}$. Baseline experiments show a large gap between reconstruction fidelity and perceptual quality, highlighting phase estimation as a fundamental bottleneck and suggesting the need for phase-aware, instrument-specific restoration methods. By releasing the dataset and baselines, the work establishes a practical benchmark to drive progress in music source restoration and motivates new evaluation protocols that reflect perceptual usefulness over exact signal reconstruction.
Abstract
Music Source Restoration (MSR) extends source separation to realistic settings where signals undergo production effects (equalization, compression, reverb) and real-world degradations, with the goal of recovering the original unprocessed sources. Existing benchmarks cannot measure restoration fidelity: synthetic datasets use unprocessed stems but unrealistic mixtures, while real production datasets provide only already-processed stems without clean references. We present MSRBench, the first benchmark explicitly designed for MSR evaluation. MSRBench contains raw stem-mixture pairs across eight instrument classes, where mixtures are produced by professional mixing engineers. These raw-processed pairs enable direct evaluation of both separation accuracy and restoration fidelity. Beyond controlled studio conditions, the mixtures are augmented with twelve real-world degradations spanning analog artifacts, acoustic environments, and lossy codecs. Baseline experiments with U-Net and BSRNN achieve SI-SNR of -37.8 dB and -23.4 dB respectively, with perceptual quality (FAD CLAP) around 0.7-0.8, demonstrating substantial room for improvement and the need for restoration-specific architectures.
