Table of Contents
Fetching ...

Adaptable Symbolic Music Infilling with MIDI-RWKV

Christian Zhou-Zheng, Philippe Pasquier

TL;DR

MIDI-RWKV introduces a compact, RWKV-7–based symbolic music infilling model tailored for controllable, long-context, multi-track workflows in computer-assisted composition. It combines REMI+ encoding, per-bar attribute controls, and a single-section infilling objective with a lightweight state-tuning adaptation method that modulates initial hidden states to capture style with minimal data. Objective and subjective evaluations show MIDI-RWKV matches or surpasses several baselines, and state tuning consistently outperforms LoRA in low-sample regimes, highlighting practical relevance for individual composers using edge devices. The work demonstrates significant potential for integration into DAWs and edge deployments, while acknowledging data biases, limited control granularity, and latency as areas for future improvement.

Abstract

Existing work in automatic music generation has mostly focused on end-to-end systems that generate either entire compositions or continuations of pieces, which are difficult for composers to iterate on. The area of computer-assisted composition, where generative models integrate into existing creative workflows, remains comparatively underexplored. In this study, we address the tasks of model style adaptation and multi-track, long-context, and controllable symbolic music infilling to enhance the process of computer-assisted composition. We present MIDI-RWKV, a small foundation model based on the RWKV-7 linear architecture, to enable efficient and coherent musical cocreation on edge devices. We also demonstrate that MIDI-RWKV admits an effective method of finetuning its initial state for style adaptation in the very-low-sample regime. We evaluate MIDI-RWKV and its state tuning on several quantitative and qualitative metrics with respect to existing models, and release model weights and code at https://github.com/christianazinn/MIDI-RWKV.

Adaptable Symbolic Music Infilling with MIDI-RWKV

TL;DR

MIDI-RWKV introduces a compact, RWKV-7–based symbolic music infilling model tailored for controllable, long-context, multi-track workflows in computer-assisted composition. It combines REMI+ encoding, per-bar attribute controls, and a single-section infilling objective with a lightweight state-tuning adaptation method that modulates initial hidden states to capture style with minimal data. Objective and subjective evaluations show MIDI-RWKV matches or surpasses several baselines, and state tuning consistently outperforms LoRA in low-sample regimes, highlighting practical relevance for individual composers using edge devices. The work demonstrates significant potential for integration into DAWs and edge deployments, while acknowledging data biases, limited control granularity, and latency as areas for future improvement.

Abstract

Existing work in automatic music generation has mostly focused on end-to-end systems that generate either entire compositions or continuations of pieces, which are difficult for composers to iterate on. The area of computer-assisted composition, where generative models integrate into existing creative workflows, remains comparatively underexplored. In this study, we address the tasks of model style adaptation and multi-track, long-context, and controllable symbolic music infilling to enhance the process of computer-assisted composition. We present MIDI-RWKV, a small foundation model based on the RWKV-7 linear architecture, to enable efficient and coherent musical cocreation on edge devices. We also demonstrate that MIDI-RWKV admits an effective method of finetuning its initial state for style adaptation in the very-low-sample regime. We evaluate MIDI-RWKV and its state tuning on several quantitative and qualitative metrics with respect to existing models, and release model weights and code at https://github.com/christianazinn/MIDI-RWKV.

Paper Structure

This paper contains 55 sections, 4 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Comparison of one track of sheet music (above) with REMI+ (below). REMI tokens have black text, tokens unique to REMI+ have white text. TrackEnd is omitted because the track continues.
  • Figure 2: Our single-section infilling representation. A contiguous set of bars is masked and its content is moved to the end of the sequence. <CONTROL> represents a control token sequence.
  • Figure 3: Example training samples for single-section infilling (above) and arbitrary masking pattern infilling (below). Measures to infill (masked) are outlined in red, the full context window in green.
  • Figure 4: Evaluation of attribute control effectiveness. Left: Average absolute difference between real and intended note density. Right: Success rate of categorical control tokens.
  • Figure 5: Visualizations of hidden state dynamics with state tuning on $N=2,C=8$ objectives. Top: Frobenius distance between state-tuned and base model states at each timestep, plus baseline of the base model's state distance to its previous state. Middle left: Violin plot of state magnitudes across the hidden state samples taken. Middle right: Layer-wise Frobenius distance between state-tuned and base model states. Bottom: Three 2-dimensional principal component analyses of each model's hidden states along the first six principal components. State v2 model data points are obscured by overlapping v3 data points in the bottom left, and likewise for base model data points by v1 data points in the bottom center.
  • ...and 8 more figures