Table of Contents
Fetching ...

Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization

Longshen Ou, Jingwei Zhao, Ziyu Wang, Gus Xia, Qihao Liang, Torin Hopkins Ye Wang

TL;DR

This work tackles the challenge of unifying symbolic music arrangement across reinterpretation, simplification, and additive generation by fine-tuning a single pre-trained symbolic model with a segment-level reconstruction objective over disentangled content and style. It introduces REMI-z, a track-wise continuity tokenization that reduces fragmentation and sequence length, facilitating better instrument-level control and modeling efficiency. Across band arrangement, piano reduction, and drum arrangement, the approach outperforms task-specific baselines in both objective metrics and human judgments, and pre-training consistently enhances performance. The results suggest broad applicability for symbolic music-to-music transformation and demonstrate practical benefits for unconditional modeling through more compact representations and lower note-level perplexity.

Abstract

We present a unified framework for automatic multitrack music arrangement that enables a single pre-trained symbolic music model to handle diverse arrangement scenarios, including reinterpretation, simplification, and additive generation. At its core is a segment-level reconstruction objective operating on token-level disentangled content and style, allowing for flexible any-to-any instrumentation transformations at inference time. To support track-wise modeling, we introduce REMI-z, a structured tokenization scheme for multitrack symbolic music that enhances modeling efficiency and effectiveness for both arrangement tasks and unconditional generation. Our method outperforms task-specific state-of-the-art models on representative tasks in different arrangement scenarios -- band arrangement, piano reduction, and drum arrangement, in both objective metrics and perceptual evaluations. Taken together, our framework demonstrates strong generality and suggests broader applicability in symbolic music-to-music transformation.

Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization

TL;DR

This work tackles the challenge of unifying symbolic music arrangement across reinterpretation, simplification, and additive generation by fine-tuning a single pre-trained symbolic model with a segment-level reconstruction objective over disentangled content and style. It introduces REMI-z, a track-wise continuity tokenization that reduces fragmentation and sequence length, facilitating better instrument-level control and modeling efficiency. Across band arrangement, piano reduction, and drum arrangement, the approach outperforms task-specific baselines in both objective metrics and human judgments, and pre-training consistently enhances performance. The results suggest broad applicability for symbolic music-to-music transformation and demonstrate practical benefits for unconditional modeling through more compact representations and lower note-level perplexity.

Abstract

We present a unified framework for automatic multitrack music arrangement that enables a single pre-trained symbolic music model to handle diverse arrangement scenarios, including reinterpretation, simplification, and additive generation. At its core is a segment-level reconstruction objective operating on token-level disentangled content and style, allowing for flexible any-to-any instrumentation transformations at inference time. To support track-wise modeling, we introduce REMI-z, a structured tokenization scheme for multitrack symbolic music that enhances modeling efficiency and effectiveness for both arrangement tasks and unconditional generation. Our method outperforms task-specific state-of-the-art models on representative tasks in different arrangement scenarios -- band arrangement, piano reduction, and drum arrangement, in both objective metrics and perceptual evaluations. Taken together, our framework demonstrates strong generality and suggests broader applicability in symbolic music-to-music transformation.
Paper Structure (48 sections, 12 equations, 6 figures, 9 tables)

This paper contains 48 sections, 12 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of the proposed unified framework for each arrangement task. The symbol $\bigoplus$ denotes concatenation of component sequences. Music segments are decomposed into three subsequences: instruments, content, and target-side history. These components form the condition sequence, with the relevant tracks from the original music as the target sequence. The model is trained to reconstruct the music from these components.
  • Figure 2: An example of the tokenized sequence for the band arrangement task. Special tokens $<$$\tt{SEP}$$>$, $<$$\tt{INSTRUMENT}$$>$, $<$$\tt{CONTENT}$$>$, and $<$$\tt{HISTORY}$$>$ are used to separate different components. Tokens starting with o-, i-, p-, d- represents the onset, instrument ID, pitch, and duration of notes.
  • Figure 3: REMI+ and REMI-z tokenization for the same bar. Contents of the same instruments are highlighted with the same color. See Appendix \ref{['app:remi_z']} for complete vocabulary.
  • Figure 4: An example of REMI-z tokenization.
  • Figure 5: The content sequence obtained by applying the operator $\mathrm{C}(\cdot)$ to the REMI-z sequence shown in Figure \ref{['fig:remiz_example']}.
  • ...and 1 more figures