DOSE : Drum One-Shot Extraction from Music Mixture

Suntae Hwang; Seonghyeon Kang; Kyungsu Kim; Semin Ahn; Kyogu Lee

DOSE : Drum One-Shot Extraction from Music Mixture

Suntae Hwang, Seonghyeon Kang, Kyungsu Kim, Semin Ahn, Kyogu Lee

TL;DR

This work addresses extracting drum one-shot samples directly from music mixtures by introducing DOSE, a generation-based model that uses a neural audio codec language model with per-drum-type decoder-only transformers and an onset-focused loss. It pairs this with RMOD, a large synthetic dataset of randomly mixed four-second mixtures and corresponding drum one-shots, augmented with layering and randomized production effects to approximate real-world variability. Empirical results show DOSE outperforms a separation-based baseline on objective perceptual metrics (FAD, MSS), with onset loss improving transient fidelity, though some domain gaps remain when evaluating realistic Groove MIDI data. The approach enables end-to-end drum sample extraction without explicit source separation and has potential for extending to additional instruments, offering practical benefits for sound design and music production.

Abstract

Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction, a task in which the goal is to extract drum one-shots that are present in the music mixture. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with corresponding drum one-shot samples. Our proposed model, Drum One- Shot Extractor (DOSE), leverages neural audio codec language models for end-to-end extraction, bypassing traditional source separation steps. Additionally, we introduce a novel onset loss, designed to encourage accurate prediction of the initial transient of drum one-shots, which is essential for capturing timbral characteristics. We compare this approach against a source separation-based extraction method as a baseline. The results, evaluated using Frechet Audio Distance (FAD) and Multi-Scale Spectral loss (MSS), demonstrate that DOSE, enhanced with onset loss, outperforms the baseline, providing more accurate and higher-quality drum one-shots from music mixtures. The code, model checkpoint, and audio examples are available at https://github.com/HSUNEH/DOSE

DOSE : Drum One-Shot Extraction from Music Mixture

TL;DR

Abstract

DOSE : Drum One-Shot Extraction from Music Mixture

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)