Steer-by-prior Editing of Symbolic Music Loops
Nicolas Jonason, Luca Casini, Bob L. T. Sturm
TL;DR
The paper tackles controllable symbolic music loop editing by introducing Superposed Language Modelling (SLM), a generalization of Masked Language Modelling that uses priors $\boldsymbol{\pi}\in[0,1]^{T\times|V|}$ over sequences, with $\sum_v \pi_{t,v}=1$ and $\pi_{t,v}>0$ when $x_t=v$, to enable inference-time constraints. It trains a bi-directional Transformer on a permutation-invariant 4-bar MIDI loop representation with up to $N=300$ notes and $A=9$ attributes, employing a Random-add superposition scheme to generate priors and steer generation. Through inference-time sampling from priors, SLM can steer attributes such as pitch, onset, and rhythm while preserving other content, demonstrated on a diverse set of editing tasks. Limitations include representation-dependent control granularity, the need for musical/programming knowledge to craft priors, and relatively slow sampling (around $7$ seconds for a 4-bar loop); future work targets rigorous evaluation, speedups, longer contexts, and natural-language interfaces for usability.
Abstract
With the goal of building a system capable of controllable symbolic music loop generation and editing, this paper explores a generalisation of Masked Language Modelling we call Superposed Language Modelling. Rather than input tokens being known or unknown, a Superposed Language Model takes priors over the sequence as input, enabling us to apply various constraints to the generation at inference time. After detailing our approach, we demonstrate our model across various editing tasks in the domain of multi-instrument MIDI loops. We end by highlighting some limitations of the approach and avenues for future work. We provides examples from the SLM across multiple generation and editing tasks at https://erl-j.github.io/slm-mml-demo/.
