Text Conditioned Symbolic Drumbeat Generation using Latent Diffusion Models

Pushkar Jajoria; James McDermott

Text Conditioned Symbolic Drumbeat Generation using Latent Diffusion Models

Pushkar Jajoria, James McDermott

TL;DR

This work addresses the challenge of text-conditioned drumbeat generation by leveraging Latent Diffusion Models trained to respond to textual prompts derived from training filenames. It combines CLIP-like contrastive pretraining to align text and MIDI with a MultiResolutionLSTM-enhanced autoencoder, and performs diffusion in a latent space to improve speed and stability. Empirical distances and a listening test show that generated drumbeats are novel and closely aligned with prompts, with quality comparable to human-created outputs. The results demonstrate the feasibility and practical potential of text-to-drumbeat synthesis in latent space, enabling real-time, controllable symbolic drum generation, with code and samples released for reuse.

Abstract

This study introduces a text-conditioned approach to generating drumbeats with Latent Diffusion Models (LDMs). It uses informative conditioning text extracted from training data filenames. By pretraining a text and drumbeat encoder through contrastive learning within a multimodal network, aligned following CLIP, we align the modalities of text and music closely. Additionally, we examine an alternative text encoder based on multihot text encodings. Inspired by musics multi-resolution nature, we propose a novel LSTM variant, MultiResolutionLSTM, designed to operate at various resolutions independently. In common with recent LDMs in the image space, it speeds up the generation process by running diffusion in a latent space provided by a pretrained unconditional autoencoder. We demonstrate the originality and variety of the generated drumbeats by measuring distance (both over binary pianorolls and in the latent space) versus the training dataset and among the generated drumbeats. We also assess the generated drumbeats through a listening test focused on questions of quality, aptness for the prompt text, and novelty. We show that the generated drumbeats are novel and apt to the prompt text, and comparable in quality to those created by human musicians.

Text Conditioned Symbolic Drumbeat Generation using Latent Diffusion Models

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 7 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 1 equation, 7 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Dataset
MIDI Preprocessing
Text Processing
Method
Text Encoding
Contrastive Language-MIDI Pretraining
Multihot text embedding
Autoencoder
Diffusion in Latent Space
Model & Training Details
Experiments and Results
Empirical Experiments
Listening Test
...and 3 more sections

Figures (7)

Figure 1: Text conditioned MIDI file generation flow incorporating all elements of the model. The overall flow involves converting text prompts to text embeddings. These text embeddings along with noise ($Z_0$) are passed into a Latent Diffusion Model, and decoded to produce the final drumbeat. The color scheme used in this diagram -- Text Encoder in green and MIDI Decoder in blue -- is consistent throughout the paper.
Figure 2: Text supervised pretraining to train a text encoder in combination with a MIDI encoder to club both the text and drumbeat pianoroll together into a shared latent space, similar to CLIP radford2021learning. The MIDI encoder is discarded after training and only the text encoder is used. The text encoder consists of a projection head over BERT L-4 512 model which maps the 512 dimensional BERT embeddings into the final text embeddings.
Figure 3: We train an Autoencoder (AE) for pianoroll drumbeats using reconstruction loss. The trained MIDI encoder is used to generate the latent embeddings for a drumbeat pianoroll corresponding to a MIDI file that is needed to train a LDM. The MIDI decoder is used after the denoising process to generate a pianoroll drumbeat for a denoised $Z$. The MIDI encoder feature extractor consists of a 3-stacked LSTM6795963 which looks at the MIDI file at different resolutions.
Figure 4: Density plots comparing Hamming (top) and Euclidean (bottom) distances of drumbeats generated from identical versus different text prompts.
Figure 5: Two MIDI files with different text prompt shows that the MIDI file generated with "fill" has fills added to it. Both the files can be heard by the readers on https://soundcloud.com/user-32049071/rock-slow-4-4 and https://soundcloud.com/user-32049071/rock-slow-4-4-with-fills
...and 2 more figures

Text Conditioned Symbolic Drumbeat Generation using Latent Diffusion Models

TL;DR

Abstract

Text Conditioned Symbolic Drumbeat Generation using Latent Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)