Learning Interpretable Representation for Controllable Polyphonic Music Generation
Ziyu Wang, Dingsu Wang, Yixiao Zhang, Gus Xia
TL;DR
This work addresses the challenge of controlling polyphonic music generation by learning two interpretable latent factors, $z_{\text{chd}}$ (content) and $z_{\text{txt}}$ (texture), within a variational autoencoder framework. It introduces a chord encoder/decoder and a texture encoder, coupled with a PianoTree-style hierarchical decoder, to enable controllable tasks such as compositional style transfer, texture sampling, and accompaniment arrangement. Objective and subjective evaluations demonstrate successful disentanglement and high-quality, controllable generation, with some texture-transferred pieces rated even higher than human originals. The approach enhances interpretability and offers a practical co-creative interface for algorithmic composition with potential for broader extension to longer-range and multi-factor control.
Abstract
While deep generative models have become the leading methods for algorithmic composition, it remains a challenging problem to control the generation process because the latent variables of most deep-learning models lack good interpretability. Inspired by the content-style disentanglement idea, we design a novel architecture, under the VAE framework, that effectively learns two interpretable latent factors of polyphonic music: chord and texture. The current model focuses on learning 8-beat long piano composition segments. We show that such chord-texture disentanglement provides a controllable generation pathway leading to a wide spectrum of applications, including compositional style transfer, texture variation, and accompaniment arrangement. Both objective and subjective evaluations show that our method achieves a successful disentanglement and high quality controlled music generation.
