Table of Contents
Fetching ...

Steer-by-prior Editing of Symbolic Music Loops

Nicolas Jonason, Luca Casini, Bob L. T. Sturm

TL;DR

The paper tackles controllable symbolic music loop editing by introducing Superposed Language Modelling (SLM), a generalization of Masked Language Modelling that uses priors $\boldsymbol{\pi}\in[0,1]^{T\times|V|}$ over sequences, with $\sum_v \pi_{t,v}=1$ and $\pi_{t,v}>0$ when $x_t=v$, to enable inference-time constraints. It trains a bi-directional Transformer on a permutation-invariant 4-bar MIDI loop representation with up to $N=300$ notes and $A=9$ attributes, employing a Random-add superposition scheme to generate priors and steer generation. Through inference-time sampling from priors, SLM can steer attributes such as pitch, onset, and rhythm while preserving other content, demonstrated on a diverse set of editing tasks. Limitations include representation-dependent control granularity, the need for musical/programming knowledge to craft priors, and relatively slow sampling (around $7$ seconds for a 4-bar loop); future work targets rigorous evaluation, speedups, longer contexts, and natural-language interfaces for usability.

Abstract

With the goal of building a system capable of controllable symbolic music loop generation and editing, this paper explores a generalisation of Masked Language Modelling we call Superposed Language Modelling. Rather than input tokens being known or unknown, a Superposed Language Model takes priors over the sequence as input, enabling us to apply various constraints to the generation at inference time. After detailing our approach, we demonstrate our model across various editing tasks in the domain of multi-instrument MIDI loops. We end by highlighting some limitations of the approach and avenues for future work. We provides examples from the SLM across multiple generation and editing tasks at https://erl-j.github.io/slm-mml-demo/.

Steer-by-prior Editing of Symbolic Music Loops

TL;DR

The paper tackles controllable symbolic music loop editing by introducing Superposed Language Modelling (SLM), a generalization of Masked Language Modelling that uses priors over sequences, with and when , to enable inference-time constraints. It trains a bi-directional Transformer on a permutation-invariant 4-bar MIDI loop representation with up to notes and attributes, employing a Random-add superposition scheme to generate priors and steer generation. Through inference-time sampling from priors, SLM can steer attributes such as pitch, onset, and rhythm while preserving other content, demonstrated on a diverse set of editing tasks. Limitations include representation-dependent control granularity, the need for musical/programming knowledge to craft priors, and relatively slow sampling (around seconds for a 4-bar loop); future work targets rigorous evaluation, speedups, longer contexts, and natural-language interfaces for usability.

Abstract

With the goal of building a system capable of controllable symbolic music loop generation and editing, this paper explores a generalisation of Masked Language Modelling we call Superposed Language Modelling. Rather than input tokens being known or unknown, a Superposed Language Model takes priors over the sequence as input, enabling us to apply various constraints to the generation at inference time. After detailing our approach, we demonstrate our model across various editing tasks in the domain of multi-instrument MIDI loops. We end by highlighting some limitations of the approach and avenues for future work. We provides examples from the SLM across multiple generation and editing tasks at https://erl-j.github.io/slm-mml-demo/.
Paper Structure (15 sections, 4 equations, 2 figures, 1 algorithm)

This paper contains 15 sections, 4 equations, 2 figures, 1 algorithm.

Figures (2)

  • Figure 1: Causal language modelling, Masked Language Modelling and Superposed language Modelling of 4 letter words with a 5 letter vocabulary.
  • Figure 2: Illustration of how priors on a unordered representations of music can be used to steer generation. This example uses a toy-representation of music with 4 pitches, 4 onset times and 4 offset times. The upper row shows non-normalized vocabulary priors as binary masks where each column represents constraints on pitch, onset and offset respectively. The bottom row illustrates the constraints in piano roll form. Each note event is colour coded. A colour gradient in the piano roll indicates that the note event(s) of the corresponding hue might be present in the region. 1. shows a fully determined musical piece containing 3 notes. Notice how the orange note is inactive, hence all its attributes are set to undefined ("-"). 2. shows a completely unconstrained musical piece where nothing is known about any of the notes. 3. shows a constraint representing a time-pitch infilling task. Notice how the red note and orange notes differ in their binary masks. Unlike red which is guaranteed to be active (all the "-" cells are 0), orange might or might not be active. This allows us to express precise ranges on the number of notes we want. 4. shows how we can use constraints to express tonality and rhythm. 5. shows how vocabulary constraints do not need to be uniform across the note events. Here, we only restrict the blue note's pitch and the orange note's onset.