Adaptable Symbolic Music Infilling with MIDI-RWKV
Christian Zhou-Zheng, Philippe Pasquier
TL;DR
MIDI-RWKV introduces a compact, RWKV-7–based symbolic music infilling model tailored for controllable, long-context, multi-track workflows in computer-assisted composition. It combines REMI+ encoding, per-bar attribute controls, and a single-section infilling objective with a lightweight state-tuning adaptation method that modulates initial hidden states to capture style with minimal data. Objective and subjective evaluations show MIDI-RWKV matches or surpasses several baselines, and state tuning consistently outperforms LoRA in low-sample regimes, highlighting practical relevance for individual composers using edge devices. The work demonstrates significant potential for integration into DAWs and edge deployments, while acknowledging data biases, limited control granularity, and latency as areas for future improvement.
Abstract
Existing work in automatic music generation has mostly focused on end-to-end systems that generate either entire compositions or continuations of pieces, which are difficult for composers to iterate on. The area of computer-assisted composition, where generative models integrate into existing creative workflows, remains comparatively underexplored. In this study, we address the tasks of model style adaptation and multi-track, long-context, and controllable symbolic music infilling to enhance the process of computer-assisted composition. We present MIDI-RWKV, a small foundation model based on the RWKV-7 linear architecture, to enable efficient and coherent musical cocreation on edge devices. We also demonstrate that MIDI-RWKV admits an effective method of finetuning its initial state for style adaptation in the very-low-sample regime. We evaluate MIDI-RWKV and its state tuning on several quantitative and qualitative metrics with respect to existing models, and release model weights and code at https://github.com/christianazinn/MIDI-RWKV.
