Table of Contents
Fetching ...

SYMPLEX: Controllable Symbolic Music Generation using Simplex Diffusion with Vocabulary Priors

Nicolas Jonason, Luca Casini, Bob L. T. Sturm

TL;DR

SYMPLEX addresses fast, controllable symbolic music generation by introducing simplex diffusion operating on probability distributions $p_t$ over an unordered note-event vocabulary. The framework trains a denoising network to recover clean distributions from noisy inputs, and performs iterative inference to generate 4-bar multi-instrument MIDI loops, with controllability achieved via vocabulary priors $p_v$ applied during decoding. Key contributions include the first application of simplex diffusion to symbolic music, a method to steer generation through input priors without task-specific fine-tuning, and an extended loop extraction pipeline that uses metrical-structure information to build a large MIDI-loop dataset. The approach combines a transformer-based encoder with an orderless representation to enable tasks like infill, variations, and instrument/pitch constraints, offering plug-and-play guidance and efficient inference for programmable music generation.

Abstract

We present a new approach for fast and controllable generation of symbolic music based on the simplex diffusion, which is essentially a diffusion process operating on probabilities rather than the signal space. This objective has been applied in domains such as natural language processing but here we apply it to generating 4-bar multi-instrument music loops using an orderless representation. We show that our model can be steered with vocabulary priors, which affords a considerable level control over the music generation process, for instance, infilling in time and pitch and choice of instrumentation -- all without task-specific model adaptation or applying extrinsic control.

SYMPLEX: Controllable Symbolic Music Generation using Simplex Diffusion with Vocabulary Priors

TL;DR

SYMPLEX addresses fast, controllable symbolic music generation by introducing simplex diffusion operating on probability distributions over an unordered note-event vocabulary. The framework trains a denoising network to recover clean distributions from noisy inputs, and performs iterative inference to generate 4-bar multi-instrument MIDI loops, with controllability achieved via vocabulary priors applied during decoding. Key contributions include the first application of simplex diffusion to symbolic music, a method to steer generation through input priors without task-specific fine-tuning, and an extended loop extraction pipeline that uses metrical-structure information to build a large MIDI-loop dataset. The approach combines a transformer-based encoder with an orderless representation to enable tasks like infill, variations, and instrument/pitch constraints, offering plug-and-play guidance and efficient inference for programmable music generation.

Abstract

We present a new approach for fast and controllable generation of symbolic music based on the simplex diffusion, which is essentially a diffusion process operating on probabilities rather than the signal space. This objective has been applied in domains such as natural language processing but here we apply it to generating 4-bar multi-instrument music loops using an orderless representation. We show that our model can be steered with vocabulary priors, which affords a considerable level control over the music generation process, for instance, infilling in time and pitch and choice of instrumentation -- all without task-specific model adaptation or applying extrinsic control.
Paper Structure (13 sections, 6 equations, 2 figures)

This paper contains 13 sections, 6 equations, 2 figures.

Figures (2)

  • Figure 1: Expressing various symbolic music generation tasks as priors on unordered representations. For clarity, this figure uses a toy representation of music with 4 pitches, 4 discrete onset times and 4 discrete offset times. The upper row shows vocabulary priors where non-zero probabilities are represented with white cells. The bottom row illustrates the constraints in piano roll form. Each note event, colour coded in the figure, has three attribute columns representing the pitch, onset and offset constraints respectively. A colour gradient in the piano roll indicates that the note event(s) of the corresponding hue might be generated in the region. 1. shows a fully determined musical piece containing 3 notes. Notice how the orange note has all attributes set to undefined, indicating an inactive note. 2. shows a completely uninformative prior. Notice how the pitch, onset and offset vocabularies don't overlap. 3. shows a prior representing a time-pitch infilling task with the piece depicted in 1. as the input. Notice how the red note and orange notes differ in their columns. Unlike red which is guaranteed to be active, orange on the other hand might be active or not. This allows us to express precise ranges on the number of notes we want. In this case, we are saying: "infill this region with at least one note". 4. shows how we can use priors to control tonality and rhythm.
  • Figure 2: Examples of the tasks we experimented with. More examples with audio are on the website.