The Spheres Dataset: Multitrack Orchestral Recordings for Music Source Separation and Information Retrieval
Jaime Garcia-Martinez, David Diaz-Guerra, John Anderson, Ricardo Falcon-Perez, Pablo Cabañas-Molero, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas
TL;DR
The Spheres dataset addresses the scarcity of publicly available multitrack orchestral data for music source separation in classical music by providing isolated stems for each instrument across main, ambient, and close mics, recorded in a controlled studio with measured room acoustics. The authors propose a careful recording and data-structuring workflow, including RIR estimates from instrument positions and a mixture framework that preserves realistic bleed, enabling both separation and dereverberation research. Baseline experiments using X-UMX demonstrate potential gains in instrument-family separation and close-mic debleeding while highlighting generalization challenges across recording setups. By releasing all materials, scripts, and RIRs, The Spheres establishes a valuable benchmark for evaluating MSS, localization, and immersive rendering in realistic classical-music scenarios, bridging synthetic data and real-world orchestral applications.
Abstract
This paper introduces The Spheres dataset, multitrack orchestral recordings designed to advance machine learning research in music source separation and related MIR tasks within the classical music domain. The dataset is composed of over one hour recordings of musical pieces performed by the Colibrì Ensemble at The Spheres recording studio, capturing two canonical works - Tchaikovsky's Romeo and Juliet and Mozart's Symphony No. 40 - along with chromatic scales and solo excerpts for each instrument. The recording setup employed 23 microphones, including close spot, main, and ambient microphones, enabling the creation of realistic stereo mixes with controlled bleeding and providing isolated stems for supervised training of source separation models. In addition, room impulse responses were estimated for each instrument position, offering valuable acoustic characterization of the recording space. We present the dataset structure, acoustic analysis, and baseline evaluations using X-UMX based models for orchestral family separation and microphone debleeding. Results highlight both the potential and the challenges of source separation in complex orchestral scenarios, underscoring the dataset's value for benchmarking and for exploring new approaches to separation, localization, dereverberation, and immersive rendering of classical music.
