Table of Contents
Fetching ...

MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation

Shih-Lun Wu, Yoon Kim, Cheng-Zhi Anna Huang

TL;DR

Midi-LLM retools a pretrained LLM for text-to-MIDI generation by expanding its vocabulary with AMT MIDI tokens and training in two stages. The approach preserves the original model structure, enabling accelerated inference with vLLM, and demonstrates superior quality, textual controllability, and faster generation compared with a recent Text2MIDI baseline. Key contributions include the AMT-based MIDI tokenization integration, a two-stage training recipe (continued pretraining plus supervised finetuning), and comprehensive comparisons showing practical benefits for editable, multitrack MIDI workflows. The work highlights the viability of leveraging LLM ecosystems for symbolic music generation and points to future directions in interactive editing and user-specific customization.

Abstract

We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM's vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM's parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live demo at https://midi-llm-demo.vercel.app.

MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation

TL;DR

Midi-LLM retools a pretrained LLM for text-to-MIDI generation by expanding its vocabulary with AMT MIDI tokens and training in two stages. The approach preserves the original model structure, enabling accelerated inference with vLLM, and demonstrates superior quality, textual controllability, and faster generation compared with a recent Text2MIDI baseline. Key contributions include the AMT-based MIDI tokenization integration, a two-stage training recipe (continued pretraining plus supervised finetuning), and comprehensive comparisons showing practical benefits for editable, multitrack MIDI workflows. The work highlights the viability of leveraging LLM ecosystems for symbolic music generation and points to future directions in interactive editing and user-specific customization.

Abstract

We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM's vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM's parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live demo at https://midi-llm-demo.vercel.app.

Paper Structure

This paper contains 14 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Midi-LLM recipe overview. We initialize Midi-LLM by expanding the token embeddings of Llama 3.2 1B LLM grattafiori2024llama with the MIDI vocabulary defined in Anticipatory Music Transformer (AMT)thickstun2024anticipatory. We then train the full model in two stages to achieve text-to-MIDI generation. See Table \ref{['tab:data']} for more information on our training data.