A Survey of Music Generation in the Context of Interaction

Ismael Agchar; Ilja Baumann; Franziska Braun; Paula Andrea Perez-Toro; Korbinian Riedhammer; Sebastian Trump; Martin Ullrich

A Survey of Music Generation in the Context of Interaction

Ismael Agchar, Ilja Baumann, Franziska Braun, Paula Andrea Perez-Toro, Korbinian Riedhammer, Sebastian Trump, Martin Ullrich

TL;DR

The paper surveys the landscape of music generation in interactive contexts, covering data formats, transformed representations, statistical and parametric modeling, and evaluation. It contrasts symbolic and audio modalities and reviews approaches from Markov models and grammars to recurrent networks, CNNs, GANs, VAEs and Transformers, analyzing their suitability for real‑time co‑creation. It highlights key datasets and architectures (e.g., Maestro, NSynth, MusicVAE, MuseNet, WaveNet, Music Transformer) and discusses evaluation strategies, including the need for interactive and human‑in‑the‑loop assessment. It concludes with perspectives on co‑creative processes, challenges in evaluation, and directions for designing and assessing live human–machine musical collaboration, including projects like Spirio Sessions.

Abstract

In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amounts of recorded or transcribed music, which in turn also allows for fairly straight-forward "performance" evaluation. However, most of these models are not suitable for human-machine co-creation through live interaction, neither is clear, how such models and resulting creations would be evaluated. This article presents a thorough review of music representation, feature analysis, heuristic algorithms, statistical and parametric modelling, and human and automatic evaluation measures, along with a discussion of which approaches and models seem most suitable for live interaction.

A Survey of Music Generation in the Context of Interaction

TL;DR

Abstract

Paper Structure (36 sections, 9 equations, 8 figures, 2 tables)

This paper contains 36 sections, 9 equations, 8 figures, 2 tables.

Introduction
Data and Formats
Formats
Symbolic
Digital Audio
Automated Transcription
Datasets
Transformed Representations
Spectrogram
Mel Spectrogram
Chromagram
Data-driven Features, Embeddings
Statistical Modeling
Markov Chains
Sampling
...and 21 more sections

Figures (8)

Figure 1: Log--Spectrogram representation of the Bethoveen's piano composition "Moonlight Sonata"
Figure 2: Log--Mel Spectrogram representation of the Bethoveen's piano composition "Moonlight Sonata"
Figure 3: Chromagram representation of the Bethoveen's piano composition "Moonlight Sonata"
Figure 4: Simplified representation of the proposed HMM layouts: note-duration layout (1), duration-note layout (2), unassigned-joint layout (3) and velocity-joint layout (4). Background HMM from Sou10.
Figure 5: A construction tree for the sentence "the dog chases the boy".
...and 3 more figures

A Survey of Music Generation in the Context of Interaction

TL;DR

Abstract

A Survey of Music Generation in the Context of Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (8)