Table of Contents
Fetching ...

Composer Style-specific Symbolic Music Generation Using Vector Quantized Discrete Diffusion Models

Jincheng Zhang, György Fazekas, Charalampos Saitis

TL;DR

This work tackles controllable symbolic music generation by composer style, a challenging problem due to discrete structure and long-range dependencies. It combines a VQ-VAE to convert symbolic music into a sequence of discrete codebook indices with a discrete diffusion model to model their distribution, conditioned on composer style via AdaLN. The approach achieves a high average composer-style control accuracy of 72.36% (with Schubert at 82.80%) and superior OA-based distribution similarity, while producing richer, diverse outputs than baselines. This diffusion-based discrete latent modeling enables reliable, style-controlled symbolic music generation and points to broader extensions to more styles and text-to-music tasks.

Abstract

Emerging Denoising Diffusion Probabilistic Models (DDPM) have become increasingly utilised because of promising results they have achieved in diverse generative tasks with continuous data, such as image and sound synthesis. Nonetheless, the success of diffusion models has not been fully extended to discrete symbolic music. We propose to combine a vector quantized variational autoencoder (VQ-VAE) and discrete diffusion models for the generation of symbolic music with desired composer styles. The trained VQ-VAE can represent symbolic music as a sequence of indexes that correspond to specific entries in a learned codebook. Subsequently, a discrete diffusion model is used to model the VQ-VAE's discrete latent space. The diffusion model is trained to generate intermediate music sequences consisting of codebook indexes, which are then decoded to symbolic music using the VQ-VAE's decoder. The evaluation results demonstrate our model can generate symbolic music with target composer styles that meet the given conditions with a high accuracy of 72.36%. Our code is available at https://github.com/jinchengzhanggg/VQVAE-Diffusion.

Composer Style-specific Symbolic Music Generation Using Vector Quantized Discrete Diffusion Models

TL;DR

This work tackles controllable symbolic music generation by composer style, a challenging problem due to discrete structure and long-range dependencies. It combines a VQ-VAE to convert symbolic music into a sequence of discrete codebook indices with a discrete diffusion model to model their distribution, conditioned on composer style via AdaLN. The approach achieves a high average composer-style control accuracy of 72.36% (with Schubert at 82.80%) and superior OA-based distribution similarity, while producing richer, diverse outputs than baselines. This diffusion-based discrete latent modeling enables reliable, style-controlled symbolic music generation and points to broader extensions to more styles and text-to-music tasks.

Abstract

Emerging Denoising Diffusion Probabilistic Models (DDPM) have become increasingly utilised because of promising results they have achieved in diverse generative tasks with continuous data, such as image and sound synthesis. Nonetheless, the success of diffusion models has not been fully extended to discrete symbolic music. We propose to combine a vector quantized variational autoencoder (VQ-VAE) and discrete diffusion models for the generation of symbolic music with desired composer styles. The trained VQ-VAE can represent symbolic music as a sequence of indexes that correspond to specific entries in a learned codebook. Subsequently, a discrete diffusion model is used to model the VQ-VAE's discrete latent space. The diffusion model is trained to generate intermediate music sequences consisting of codebook indexes, which are then decoded to symbolic music using the VQ-VAE's decoder. The evaluation results demonstrate our model can generate symbolic music with target composer styles that meet the given conditions with a high accuracy of 72.36%. Our code is available at https://github.com/jinchengzhanggg/VQVAE-Diffusion.
Paper Structure (11 sections, 9 equations, 4 figures, 2 tables)

This paper contains 11 sections, 9 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our approach uses a VQ-VAE to learn a codebook, whose composition is subsequently modeled with a discrete diffusion model. A Transformer is used as our denoising network.
  • Figure 2: Pianorolls generated by our vector quantized diffusion model (top) and samples from the training set (bottom).
  • Figure 3: Accuracy calculated by assessing whether the generated pieces' composers predicted by the classifier meet the conditions fed to our vector quantized diffusion model.
  • Figure 4: Distribution of Overall Preference ratings for our vector quantized diffusion model and other methods.