MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation
Shih-Lun Wu, Yoon Kim, Cheng-Zhi Anna Huang
TL;DR
Midi-LLM retools a pretrained LLM for text-to-MIDI generation by expanding its vocabulary with AMT MIDI tokens and training in two stages. The approach preserves the original model structure, enabling accelerated inference with vLLM, and demonstrates superior quality, textual controllability, and faster generation compared with a recent Text2MIDI baseline. Key contributions include the AMT-based MIDI tokenization integration, a two-stage training recipe (continued pretraining plus supervised finetuning), and comprehensive comparisons showing practical benefits for editable, multitrack MIDI workflows. The work highlights the viability of leveraging LLM ecosystems for symbolic music generation and points to future directions in interactive editing and user-specific customization.
Abstract
We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM's vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM's parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live demo at https://midi-llm-demo.vercel.app.
