MIDI-GPT: A Controllable Generative Model for Computer-Assisted Multitrack Music Composition
Philippe Pasquier, Jeff Ens, Nathan Fradet, Paul Triana, Davide Rizzotti, Jean-Baptiste Rolland, Maryam Safi
TL;DR
MIDI-GPT tackles the gap between powerful generative models and practical, controllable music creation by introducing a transformer-based system with a novel multi-track tokenization approach that supports track- and bar-level infilling and explicit attribute controls. It provides two tokenizations (Multi-Track and Bar-Fill) and adds expressiveness through VELOCITY and DELTA tokens, enabling dynamic and nuanced performances. The model is trained on the GigaMIDI dataset and evaluated for originality, stylistic similarity, and control effectiveness, showing reduced data duplication with longer generations, stylistic alignment with training data, and meaningful control over density, duration, and polyphony. The work demonstrates real-world viability through collaborations with industry partners and tool integrations, and outlines future directions for real-time generation, larger contexts, and broader applicability. Overall, MIDI-GPT advances controllable, cooperative music generation in practical workflows, with demonstrated originality, style preservation, and usable attribute controls.
Abstract
We present and release MIDI-GPT, a generative system based on the Transformer architecture that is designed for computer-assisted music composition workflows. MIDI-GPT supports the infilling of musical material at the track and bar level, and can condition generation on attributes including: instrument type, musical style, note density, polyphony level, and note duration. In order to integrate these features, we employ an alternative representation for musical material, creating a time-ordered sequence of musical events for each track and concatenating several tracks into a single sequence, rather than using a single time-ordered sequence where the musical events corresponding to different tracks are interleaved. We also propose a variation of our representation allowing for expressiveness. We present experimental results that demonstrate that MIDI-GPT is able to consistently avoid duplicating the musical material it was trained on, generate music that is stylistically similar to the training dataset, and that attribute controls allow enforcing various constraints on the generated material. We also outline several real-world applications of MIDI-GPT, including collaborations with industry partners that explore the integration and evaluation of MIDI-GPT into commercial products, as well as several artistic works produced using it.
