Table of Contents
Fetching ...

Flexible Control in Symbolic Music Generation via Musical Metadata

Sangjun Han, Jiwon Ham, Chaeeun Lee, Heejin Kim, Soojong Do, Sihyuk Yi, Jun Seo, Seoyoon Kim, Yountae Jung, Woohyung Lim

TL;DR

This work tackles flexible controllable symbolic music generation by conditioning a decoder-only autoregressive Transformer on musical metadata to produce 4-bar multitrack MIDI. A key novelty is training with random drops of conditioning tokens, enabling the model to accept partial inputs and realize an or-relationship among controls, improving practical usability. The approach leverages REMI+ representations over diverse MIDI datasets, achieving a balance between musical fidelity and controllability, and is validated through both quantitative metrics and a human listening test that positions it near ground-truth quality. The demonstrated system supports interactive composition for motifs and themes, with a publicly demonstrated video illustrating its capabilities, and highlights avenues for longer-term and more fine-grained control in future work.

Abstract

In this work, we introduce the demonstration of symbolic music generation, focusing on providing short musical motifs that serve as the central theme of the narrative. For the generation, we adopt an autoregressive model which takes musical metadata as inputs and generates 4 bars of multitrack MIDI sequences. During training, we randomly drop tokens from the musical metadata to guarantee flexible control. It provides users with the freedom to select input types while maintaining generative performance, enabling greater flexibility in music composition. We validate the effectiveness of the strategy through experiments in terms of model capacity, musical fidelity, diversity, and controllability. Additionally, we scale up the model and compare it with other music generation model through a subjective test. Our results indicate its superiority in both control and music quality. We provide a URL link https://www.youtube.com/watch?v=-0drPrFJdMQ to our demonstration video.

Flexible Control in Symbolic Music Generation via Musical Metadata

TL;DR

This work tackles flexible controllable symbolic music generation by conditioning a decoder-only autoregressive Transformer on musical metadata to produce 4-bar multitrack MIDI. A key novelty is training with random drops of conditioning tokens, enabling the model to accept partial inputs and realize an or-relationship among controls, improving practical usability. The approach leverages REMI+ representations over diverse MIDI datasets, achieving a balance between musical fidelity and controllability, and is validated through both quantitative metrics and a human listening test that positions it near ground-truth quality. The demonstrated system supports interactive composition for motifs and themes, with a publicly demonstrated video illustrating its capabilities, and highlights avenues for longer-term and more fine-grained control in future work.

Abstract

In this work, we introduce the demonstration of symbolic music generation, focusing on providing short musical motifs that serve as the central theme of the narrative. For the generation, we adopt an autoregressive model which takes musical metadata as inputs and generates 4 bars of multitrack MIDI sequences. During training, we randomly drop tokens from the musical metadata to guarantee flexible control. It provides users with the freedom to select input types while maintaining generative performance, enabling greater flexibility in music composition. We validate the effectiveness of the strategy through experiments in terms of model capacity, musical fidelity, diversity, and controllability. Additionally, we scale up the model and compare it with other music generation model through a subjective test. Our results indicate its superiority in both control and music quality. We provide a URL link https://www.youtube.com/watch?v=-0drPrFJdMQ to our demonstration video.
Paper Structure (10 sections, 3 figures, 1 table)

This paper contains 10 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: The user interface of our demonstration.
  • Figure 2: Upper: The original next token predictions, Bottom: The next token predictions with random drop conditions. The dotted boxes indicate that tokens are dropped.
  • Figure 3: Win rates of our generated samples compared to the ground truth (GT) and FIGARO. The solid line indicates the standard deviation.