Table of Contents
Fetching ...

Learning Interpretable Representation for Controllable Polyphonic Music Generation

Ziyu Wang, Dingsu Wang, Yixiao Zhang, Gus Xia

TL;DR

This work addresses the challenge of controlling polyphonic music generation by learning two interpretable latent factors, $z_{\text{chd}}$ (content) and $z_{\text{txt}}$ (texture), within a variational autoencoder framework. It introduces a chord encoder/decoder and a texture encoder, coupled with a PianoTree-style hierarchical decoder, to enable controllable tasks such as compositional style transfer, texture sampling, and accompaniment arrangement. Objective and subjective evaluations demonstrate successful disentanglement and high-quality, controllable generation, with some texture-transferred pieces rated even higher than human originals. The approach enhances interpretability and offers a practical co-creative interface for algorithmic composition with potential for broader extension to longer-range and multi-factor control.

Abstract

While deep generative models have become the leading methods for algorithmic composition, it remains a challenging problem to control the generation process because the latent variables of most deep-learning models lack good interpretability. Inspired by the content-style disentanglement idea, we design a novel architecture, under the VAE framework, that effectively learns two interpretable latent factors of polyphonic music: chord and texture. The current model focuses on learning 8-beat long piano composition segments. We show that such chord-texture disentanglement provides a controllable generation pathway leading to a wide spectrum of applications, including compositional style transfer, texture variation, and accompaniment arrangement. Both objective and subjective evaluations show that our method achieves a successful disentanglement and high quality controlled music generation.

Learning Interpretable Representation for Controllable Polyphonic Music Generation

TL;DR

This work addresses the challenge of controlling polyphonic music generation by learning two interpretable latent factors, (content) and (texture), within a variational autoencoder framework. It introduces a chord encoder/decoder and a texture encoder, coupled with a PianoTree-style hierarchical decoder, to enable controllable tasks such as compositional style transfer, texture sampling, and accompaniment arrangement. Objective and subjective evaluations demonstrate successful disentanglement and high-quality, controllable generation, with some texture-transferred pieces rated even higher than human originals. The approach enhances interpretability and offers a practical co-creative interface for algorithmic composition with potential for broader extension to longer-range and multi-factor control.

Abstract

While deep generative models have become the leading methods for algorithmic composition, it remains a challenging problem to control the generation process because the latent variables of most deep-learning models lack good interpretability. Inspired by the content-style disentanglement idea, we design a novel architecture, under the VAE framework, that effectively learns two interpretable latent factors of polyphonic music: chord and texture. The current model focuses on learning 8-beat long piano composition segments. We show that such chord-texture disentanglement provides a controllable generation pathway leading to a wide spectrum of applications, including compositional style transfer, texture variation, and accompaniment arrangement. Both objective and subjective evaluations show that our method achieves a successful disentanglement and high quality controlled music generation.

Paper Structure

This paper contains 19 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: The model diagram.
  • Figure 2: An example of compositional style transfer of 16-bar-long samples when $k = 2$.
  • Figure 3: Examples of texture variations via posterior sampling and prior sampling.
  • Figure 4: An example of accompaniment arrangement conditioned on melody, chord progression, and first 2 bars of accompaniment.
  • Figure 5: Results of objective measurement.
  • ...and 2 more figures