Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

Zhiwei Lin; Jun Chen; Boshi Tang; Binzhu Sha; Jing Yang; Yaolong Ju; Fan Fan; Shiyin Kang; Zhiyong Wu; Helen Meng

Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

Zhiwei Lin, Jun Chen, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu, Helen Meng

TL;DR

The paper addresses the problem of modeling and generating long multi-track symbolic music with VAEs, where prior approaches struggle with long sequences and coherence. It introduces Multi-view MidiVAE, which combines a 2-D OctupleMIDI representation with dual transformer-based views (Track-view and Bar-view) and an adaptive fusion mechanism to learn a shared latent space for global harmony and local detail. The objective combines reconstruction losses for the full sequence and its projections plus a KL regularization, expressed as $L_{\text{total}} = L_{\text{rs}} + L_{\text{rst}} + L_{\text{rsb}} + \beta L_{\text{KL}}$. On CocoChorales, the approach yields significant improvements in reconstruction accuracy and listener MOS compared with baselines such as REMI+ and single-view variants, demonstrating enhanced long-range modeling for multi-instrument symbolic music. Overall, the method provides a scalable, effective VAE framework for coherent long-range symbolic music generation and editing.

Abstract

Variational Autoencoders (VAEs) constitute a crucial component of neural symbolic music generation, among which some works have yielded outstanding results and attracted considerable attention. Nevertheless, previous VAEs still encounter issues with overly long feature sequences and generated results lack contextual coherence, thus the challenge of modeling long multi-track symbolic music still remains unaddressed. To this end, we propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music. The Multi-view MidiVAE utilizes the two-dimensional (2-D) representation, OctupleMIDI, to capture relationships among notes while reducing the feature sequences length. Moreover, we focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy to integrate both Track- and Bar-view MidiVAE features. Objective and subjective experimental results on the CocoChorales dataset demonstrate that, compared to the baseline, Multi-view MidiVAE exhibits significant improvements in terms of modeling long multi-track symbolic music.

Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

TL;DR

. On CocoChorales, the approach yields significant improvements in reconstruction accuracy and listener MOS compared with baselines such as REMI+ and single-view variants, demonstrating enhanced long-range modeling for multi-instrument symbolic music. Overall, the method provides a scalable, effective VAE framework for coherent long-range symbolic music generation and editing.

Abstract

Paper Structure (14 sections, 3 equations, 3 figures, 2 tables)

This paper contains 14 sections, 3 equations, 3 figures, 2 tables.

Introduction
Methodology
OctupleMIDI Representation
Bar-view MidiVAE
Track-view MidiVAE
Multi-view MidiVAE
EXPERIMENTS
Dataset
Experiment Setup
Comparison with Baseline
Ablation Study
The Effect of OctupleMIDI
Investigation of Multi-view
CONCLUSIONS

Figures (3)

Figure 1: The overall diagram of the proposed Multi-view MidiVAE. The model mainly contains Track- and Bar-view encoders, a multi-view information fusion (MIF), Track- and Bar-view decoders as well as an adaptive feature fusion (AFF).
Figure 2: The schematic diagram of OctupleMIDI. The "Dur" and "Inst" mean duration and instrument respectively.
Figure 3: (a) The details of Bar-view MidiVAE, where "PE" refers to the positional encoding. (b) The details of Track-view MidiVAE, where "Intra-inst" and "Inter-inst" respectively denote Intra-instrument and Inter-instrument.

Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

TL;DR

Abstract

Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)