Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation
Zhiwei Lin, Jun Chen, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu, Helen Meng
TL;DR
The paper addresses the problem of modeling and generating long multi-track symbolic music with VAEs, where prior approaches struggle with long sequences and coherence. It introduces Multi-view MidiVAE, which combines a 2-D OctupleMIDI representation with dual transformer-based views (Track-view and Bar-view) and an adaptive fusion mechanism to learn a shared latent space for global harmony and local detail. The objective combines reconstruction losses for the full sequence and its projections plus a KL regularization, expressed as $L_{\text{total}} = L_{\text{rs}} + L_{\text{rst}} + L_{\text{rsb}} + \beta L_{\text{KL}}$. On CocoChorales, the approach yields significant improvements in reconstruction accuracy and listener MOS compared with baselines such as REMI+ and single-view variants, demonstrating enhanced long-range modeling for multi-instrument symbolic music. Overall, the method provides a scalable, effective VAE framework for coherent long-range symbolic music generation and editing.
Abstract
Variational Autoencoders (VAEs) constitute a crucial component of neural symbolic music generation, among which some works have yielded outstanding results and attracted considerable attention. Nevertheless, previous VAEs still encounter issues with overly long feature sequences and generated results lack contextual coherence, thus the challenge of modeling long multi-track symbolic music still remains unaddressed. To this end, we propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music. The Multi-view MidiVAE utilizes the two-dimensional (2-D) representation, OctupleMIDI, to capture relationships among notes while reducing the feature sequences length. Moreover, we focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy to integrate both Track- and Bar-view MidiVAE features. Objective and subjective experimental results on the CocoChorales dataset demonstrate that, compared to the baseline, Multi-view MidiVAE exhibits significant improvements in terms of modeling long multi-track symbolic music.
