Table of Contents
Fetching ...

DOSE : Drum One-Shot Extraction from Music Mixture

Suntae Hwang, Seonghyeon Kang, Kyungsu Kim, Semin Ahn, Kyogu Lee

TL;DR

This work addresses extracting drum one-shot samples directly from music mixtures by introducing DOSE, a generation-based model that uses a neural audio codec language model with per-drum-type decoder-only transformers and an onset-focused loss. It pairs this with RMOD, a large synthetic dataset of randomly mixed four-second mixtures and corresponding drum one-shots, augmented with layering and randomized production effects to approximate real-world variability. Empirical results show DOSE outperforms a separation-based baseline on objective perceptual metrics (FAD, MSS), with onset loss improving transient fidelity, though some domain gaps remain when evaluating realistic Groove MIDI data. The approach enables end-to-end drum sample extraction without explicit source separation and has potential for extending to additional instruments, offering practical benefits for sound design and music production.

Abstract

Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction, a task in which the goal is to extract drum one-shots that are present in the music mixture. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with corresponding drum one-shot samples. Our proposed model, Drum One- Shot Extractor (DOSE), leverages neural audio codec language models for end-to-end extraction, bypassing traditional source separation steps. Additionally, we introduce a novel onset loss, designed to encourage accurate prediction of the initial transient of drum one-shots, which is essential for capturing timbral characteristics. We compare this approach against a source separation-based extraction method as a baseline. The results, evaluated using Frechet Audio Distance (FAD) and Multi-Scale Spectral loss (MSS), demonstrate that DOSE, enhanced with onset loss, outperforms the baseline, providing more accurate and higher-quality drum one-shots from music mixtures. The code, model checkpoint, and audio examples are available at https://github.com/HSUNEH/DOSE

DOSE : Drum One-Shot Extraction from Music Mixture

TL;DR

This work addresses extracting drum one-shot samples directly from music mixtures by introducing DOSE, a generation-based model that uses a neural audio codec language model with per-drum-type decoder-only transformers and an onset-focused loss. It pairs this with RMOD, a large synthetic dataset of randomly mixed four-second mixtures and corresponding drum one-shots, augmented with layering and randomized production effects to approximate real-world variability. Empirical results show DOSE outperforms a separation-based baseline on objective perceptual metrics (FAD, MSS), with onset loss improving transient fidelity, though some domain gaps remain when evaluating realistic Groove MIDI data. The approach enables end-to-end drum sample extraction without explicit source separation and has potential for extending to additional instruments, offering practical benefits for sound design and music production.

Abstract

Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction, a task in which the goal is to extract drum one-shots that are present in the music mixture. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with corresponding drum one-shot samples. Our proposed model, Drum One- Shot Extractor (DOSE), leverages neural audio codec language models for end-to-end extraction, bypassing traditional source separation steps. Additionally, we introduce a novel onset loss, designed to encourage accurate prediction of the initial transient of drum one-shots, which is essential for capturing timbral characteristics. We compare this approach against a source separation-based extraction method as a baseline. The results, evaluated using Frechet Audio Distance (FAD) and Multi-Scale Spectral loss (MSS), demonstrate that DOSE, enhanced with onset loss, outperforms the baseline, providing more accurate and higher-quality drum one-shots from music mixtures. The code, model checkpoint, and audio examples are available at https://github.com/HSUNEH/DOSE

Paper Structure

This paper contains 14 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of our approach. Given an audio mixture as input, each Drum One-Shot Extractor(DOSE) model extract one-shot audio samples for kick, snare, and hi-hat drums.
  • Figure 2: Proposed Method. The input audio mixture is encoded into a sequence of discrete tokens using a frozen DAC encoder, which are then fed into a decoder-only transformer. The transformer is trained to autoregressively predict the groundtruth drum one-shot tokens by minimizing two losses: onset loss and full-length loss. Finally, the predicted token sequence is decoded into drum one-shot audio using the DAC decoder.
  • Figure 3: Dataset generation process. First, kick, snare, and hi-hat loops are synthesized from one-shot drum audio samples using randomly generated MIDI notes. Next, optional bass, piano, guitar, and vocal loops are selected. The drum loops and other musical loops are then processed through independent mixing chains, which apply gain, EQ, compression, panning, limiting, delay, and reverb effects. Finally, all tracks are combined and passed through a mastering chain consisting of EQ and limiter effects.