Anticipatory Music Transformer

John Thickstun; David Hall; Chris Donahue; Percy Liang

Anticipatory Music Transformer

John Thickstun, David Hall, Chris Donahue, Percy Liang

TL;DR

The paper introduces Anticipatory Music Transformer, a framework for controllable generation of temporal point processes by interleaving events with asynchronous controls using a stopping-time-based placement. It leverages arrival-time encoding and a fixed context length to enable anticipatory inference and infilling, trained on the Lakh MIDI dataset with extensive data augmentation. Empirical results show that anticipatory infilling matches autoregressive performance while enabling new control tasks such as accompaniment, with human evaluators often preferring anticipatory outputs for both continuation and accompaniment. The approach offers a generalizable, locality-friendly method for controllable symbolic music generation and potentially other temporal domains requiring asynchronous conditioning.

Abstract

We introduce anticipation: a method for constructing a controllable generative model of a temporal point process (the event process) conditioned asynchronously on realizations of a second, correlated process (the control process). We achieve this by interleaving sequences of events and controls, such that controls appear following stopping times in the event sequence. This work is motivated by problems arising in the control of symbolic music generation. We focus on infilling control tasks, whereby the controls are a subset of the events themselves, and conditional generation completes a sequence of events given the fixed control events. We train anticipatory infilling models using the large and diverse Lakh MIDI music dataset. These models match the performance of autoregressive models for prompted music generation, with the additional capability to perform infilling control tasks, including accompaniment. Human evaluators report that an anticipatory model produces accompaniments with similar musicality to even music composed by humans over a 20-second clip.

Anticipatory Music Transformer

TL;DR

Abstract

Paper Structure (42 sections, 17 equations, 7 figures, 11 tables, 2 algorithms)

This paper contains 42 sections, 17 equations, 7 figures, 11 tables, 2 algorithms.

Introduction
Contributions.
Music as a Temporal Point Process
Modeling Temporal Point Processes.
Modeling Arrival Times.
Encoding Music as Sequences
Anticipation
Stopping Times
Sparse Sequences
Training Anticipatory Models
Anticipatory Inference
Anticipatory Infilling Models
Anticipatory Infilling Models of Music
Automatic Metrics
Human Evaluation
...and 27 more sections

Figures (7)

Figure 1: We construct generative models for sequences of events $\mathbf{e}_{1:N}$, conditioned on controls $\mathbf{u}_{1:K}$. We serialize these paired sequences to define an autoregressive factorization of the joint distribution over events and controls. Anticipation interleaves event and control sequences so that a control $\mathbf{u}_k$ on time $s_k$ appears in the recent history when predicting events near time $s_k$. An anticipated control $\mathbf{u}_k$ on time $s_k$ appears as if it were at approximately time $s_k' = s_k-\delta$. For example, when predicting $\mathbf{e}_{j+7}$ above, the recent context of the anticipation sequence contains contains past events and controls, as well as the future control $\mathbf{u}_{k+5}$; we say that a model predicting $\mathbf{e}_{j+7}$ given this context $\emph{anticipates}$ the control $\mathbf{u}_{k+5}$, approximately $\delta$ seconds in advance. Crucially, to be able to condition on controls, the index that immediately preceeds each control in the serialized sequence must be a stopping time, a property that naively interleaving events and controls using the sort order of times $s_k'$ does not satisfy.
Figure 2: The distribution of sequence lengths calculated for the arrival-time tokenized Lakh MIDI validation split. Mean sequence length is 12071.0 tokens, with a standard deviation of 9711.0 tokens.
Figure 3: The distribution of instantaneous tokens/second calculated for the arrival-time tokenized Lakh MIDI validation split. Mean instantaneous tokens/second for the Lakh MIDI dataset is 68.0 with a standard deviation of 51.0 tokens/second.
Figure 4: The interface used by evaluators to assess the relative musicality of paired music clips.
Figure 5: Visualizations of 20.0-second music clips. Each rectangle indicates a musical event with an onset time, duration (width), and pitch (height). Colors indicate distinct instrumental parts. For the accompaniment task, events in the blue instrumental part are used as control events. Top: a five-second prompt followed by the original continuation of only the melodic instrumental line (highest; blue). Middle: the five-second prompt followed by a generated autoregressive continuation, ignoring the original melodic line. Bottom: the prompt followed by a generated anticipatory accompaniment of the original melodic instrumental line.
...and 2 more figures

Theorems & Definitions (13)

Definition 2.1
Definition 2.2
Definition 2.3
Definition 3.1
Example 3.2
Definition 3.3
Example 3.4
Definition 3.5
Example 3.6
Definition C.1
...and 3 more

Anticipatory Music Transformer

TL;DR

Abstract

Anticipatory Music Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (13)