A Survey of Music Generation in the Context of Interaction
Ismael Agchar, Ilja Baumann, Franziska Braun, Paula Andrea Perez-Toro, Korbinian Riedhammer, Sebastian Trump, Martin Ullrich
TL;DR
The paper surveys the landscape of music generation in interactive contexts, covering data formats, transformed representations, statistical and parametric modeling, and evaluation. It contrasts symbolic and audio modalities and reviews approaches from Markov models and grammars to recurrent networks, CNNs, GANs, VAEs and Transformers, analyzing their suitability for real‑time co‑creation. It highlights key datasets and architectures (e.g., Maestro, NSynth, MusicVAE, MuseNet, WaveNet, Music Transformer) and discusses evaluation strategies, including the need for interactive and human‑in‑the‑loop assessment. It concludes with perspectives on co‑creative processes, challenges in evaluation, and directions for designing and assessing live human–machine musical collaboration, including projects like Spirio Sessions.
Abstract
In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amounts of recorded or transcribed music, which in turn also allows for fairly straight-forward "performance" evaluation. However, most of these models are not suitable for human-machine co-creation through live interaction, neither is clear, how such models and resulting creations would be evaluated. This article presents a thorough review of music representation, feature analysis, heuristic algorithms, statistical and parametric modelling, and human and automatic evaluation measures, along with a discussion of which approaches and models seem most suitable for live interaction.
