Table of Contents
Fetching ...

Toward Deep Drum Source Separation

Alessandro Ilic Mezza, Riccardo Giampiccolo, Alberto Bernardini, Augusto Sarti

TL;DR

This work addresses data scarcity in deep drum source separation by introducing StemGMD, a large MIDI-driven dataset of isolated drum stems covering a canonical nine-piece kit, and LarsNet, a five-U-Net deep model that separates five stems from stereo drum mixtures. The approach uses spectro-temporal masks estimated by parallel U-Nets, enhanced by optional alpha-Wiener filtering and extensive data augmentation, evaluated against SAB-NMF and NMFD baselines on a dedicated Eval Session. Results show substantial SDR gains and reduced cross-talk, with LarsNet achieving near-zero output for zero-energy stems and real-time performance on standard hardware, establishing a solid baseline for future deep DSS research. The dataset and model hold promise for remixing, transcription, and educational tools, enabling finer-grained control over individual drum elements.

Abstract

In the past, the field of drum source separation faced significant challenges due to limited data availability, hindering the adoption of cutting-edge deep learning methods that have found success in other related audio applications. In this manuscript, we introduce StemGMD, a large-scale audio dataset of isolated single-instrument drum stems. Each audio clip is synthesized from MIDI recordings of expressive drums performances using ten real-sounding acoustic drum kits. Totaling 1224 hours, StemGMD is the largest audio dataset of drums to date and the first to comprise isolated audio clips for every instrument in a canonical nine-piece drum kit. We leverage StemGMD to develop LarsNet, a novel deep drum source separation model. Through a bank of dedicated U-Nets, LarsNet can separate five stems from a stereo drum mixture faster than real-time and is shown to significantly outperform state-of-the-art nonnegative spectro-temporal factorization methods.

Toward Deep Drum Source Separation

TL;DR

This work addresses data scarcity in deep drum source separation by introducing StemGMD, a large MIDI-driven dataset of isolated drum stems covering a canonical nine-piece kit, and LarsNet, a five-U-Net deep model that separates five stems from stereo drum mixtures. The approach uses spectro-temporal masks estimated by parallel U-Nets, enhanced by optional alpha-Wiener filtering and extensive data augmentation, evaluated against SAB-NMF and NMFD baselines on a dedicated Eval Session. Results show substantial SDR gains and reduced cross-talk, with LarsNet achieving near-zero output for zero-energy stems and real-time performance on standard hardware, establishing a solid baseline for future deep DSS research. The dataset and model hold promise for remixing, transcription, and educational tools, enabling finer-grained control over individual drum elements.

Abstract

In the past, the field of drum source separation faced significant challenges due to limited data availability, hindering the adoption of cutting-edge deep learning methods that have found success in other related audio applications. In this manuscript, we introduce StemGMD, a large-scale audio dataset of isolated single-instrument drum stems. Each audio clip is synthesized from MIDI recordings of expressive drums performances using ten real-sounding acoustic drum kits. Totaling 1224 hours, StemGMD is the largest audio dataset of drums to date and the first to comprise isolated audio clips for every instrument in a canonical nine-piece drum kit. We leverage StemGMD to develop LarsNet, a novel deep drum source separation model. Through a bank of dedicated U-Nets, LarsNet can separate five stems from a stereo drum mixture faster than real-time and is shown to significantly outperform state-of-the-art nonnegative spectro-temporal factorization methods.
Paper Structure (13 sections, 7 equations, 2 figures, 3 tables)

This paper contains 13 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: LarsNet architecture.
  • Figure 2: U-Net architecture.