Table of Contents
Fetching ...

Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

Michael Yeung, Keisuke Toyama, Toya Teramoto, Shusuke Takahashi, Tamaki Kojima

TL;DR

Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities, is introduced and features extracted from music foundation models (MFMs) are proposed to enhance robustness to out-of-domain drum audio.

Abstract

Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.

Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

TL;DR

Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities, is introduced and features extracted from music foundation models (MFMs) are proposed to enhance robustness to out-of-domain drum audio.

Abstract

Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.

Paper Structure

This paper contains 9 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview. Noise-to-Notes is a diffusion model that transcribes drum audio. By reframing ADT as a generative task, N2N is capable of transcription with complete audio (conditional), partial audio (inpainting), and absence of audio (unconditional).
  • Figure 2: Noise-to-Notes architecture. N2N is an audio-conditioned transformer-based diffusion model. Drum audio features are extracted with a log mel-spectrogram and music foundation model. These features, combined with timestep ($\sigma_t$) information, modulate the decoder through cross attention and FiLM layers.
  • Figure 3: Onset F1 scores per drum component. E-GMD dataset is used for training while IDMT and MDB are external data. There are a different number of drum components labeled for each dataset, and predictions are remapped to 7, 5 and 3 components for E-GMD, MDB and IDMT datasets, respectively.
  • Figure 4: Speed-accuracy trade-off. E-GMD test set performance and associated inference time using N2N with different sampling steps. Inference time is for a single five second transcript on 1 A100 NVIDIA GPU.
  • Figure 5: t-SNE plots of spectrogram (left) and MFM (right) features. Features were extracted from N2N input projection layer.
  • ...and 1 more figures