Table of Contents
Fetching ...

Why Perturbing Symbolic Music is Necessary: Fitting the Distribution of Never-used Notes through a Joint Probabilistic Diffusion Model

Shipei Liu, Xiaoya Fan, Guowei Wu

TL;DR

Addresses the limitation of language-based symbolic music generation in capturing frequency continuity and rare notes by modeling a joint distribution over notes $n$, chords $c$, and sections $s$ with diffusion. Proposes Music-Diff, a joint probabilistic diffusion framework with Forward and Reverse processes, Joint Semantic Pre-training (JSP), and a multi-branch denoiser (Symb-RWKV). Demonstrates that joint perturbation and conditional denoising produce higher sample diversity and improved long-range structure compared with language models and existing diffusion methods, across pitch, rhythm, and structure metrics. Case studies validate rhythmic advantages and coherent hierarchical organization, suggesting practical potential for scalable symbolic music generation and polyphonic collaboration.

Abstract

Existing music generation models are mostly language-based, neglecting the frequency continuity property of notes, resulting in inadequate fitting of rare or never-used notes and thus reducing the diversity of generated samples. We argue that the distribution of notes can be modeled by translational invariance and periodicity, especially using diffusion models to generalize notes by injecting frequency-domain Gaussian noise. However, due to the low-density nature of music symbols, estimating the distribution of notes latent in the high-density solution space poses significant challenges. To address this problem, we introduce the Music-Diff architecture, which fits a joint distribution of notes and accompanying semantic information to generate symbolic music conditionally. We first enhance the fragmentation module for extracting semantics by using event-based notations and the structural similarity index, thereby preventing boundary blurring. As a prerequisite for multivariate perturbation, we introduce a joint pre-training method to construct the progressions between notes and musical semantics while avoiding direct modeling of low-density notes. Finally, we recover the perturbed notes by a multi-branch denoiser that fits multiple noise objectives via Pareto optimization. Our experiments suggest that in contrast to language models, joint probability diffusion models perturbing at both note and semantic levels can provide more sample diversity and compositional regularity. The case study highlights the rhythmic advantages of our model over language- and DDPMs-based models by analyzing the hierarchical structure expressed in the self-similarity metrics.

Why Perturbing Symbolic Music is Necessary: Fitting the Distribution of Never-used Notes through a Joint Probabilistic Diffusion Model

TL;DR

Addresses the limitation of language-based symbolic music generation in capturing frequency continuity and rare notes by modeling a joint distribution over notes , chords , and sections with diffusion. Proposes Music-Diff, a joint probabilistic diffusion framework with Forward and Reverse processes, Joint Semantic Pre-training (JSP), and a multi-branch denoiser (Symb-RWKV). Demonstrates that joint perturbation and conditional denoising produce higher sample diversity and improved long-range structure compared with language models and existing diffusion methods, across pitch, rhythm, and structure metrics. Case studies validate rhythmic advantages and coherent hierarchical organization, suggesting practical potential for scalable symbolic music generation and polyphonic collaboration.

Abstract

Existing music generation models are mostly language-based, neglecting the frequency continuity property of notes, resulting in inadequate fitting of rare or never-used notes and thus reducing the diversity of generated samples. We argue that the distribution of notes can be modeled by translational invariance and periodicity, especially using diffusion models to generalize notes by injecting frequency-domain Gaussian noise. However, due to the low-density nature of music symbols, estimating the distribution of notes latent in the high-density solution space poses significant challenges. To address this problem, we introduce the Music-Diff architecture, which fits a joint distribution of notes and accompanying semantic information to generate symbolic music conditionally. We first enhance the fragmentation module for extracting semantics by using event-based notations and the structural similarity index, thereby preventing boundary blurring. As a prerequisite for multivariate perturbation, we introduce a joint pre-training method to construct the progressions between notes and musical semantics while avoiding direct modeling of low-density notes. Finally, we recover the perturbed notes by a multi-branch denoiser that fits multiple noise objectives via Pareto optimization. Our experiments suggest that in contrast to language models, joint probability diffusion models perturbing at both note and semantic levels can provide more sample diversity and compositional regularity. The case study highlights the rhythmic advantages of our model over language- and DDPMs-based models by analyzing the hierarchical structure expressed in the self-similarity metrics.
Paper Structure (16 sections, 6 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 6 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of our Music-Diff architecture, consisting of (a) fragmentation, (b) forward, and (c) reverse processes for semantic extraction, noise perturbation, and note recovery, respectively.
  • Figure 2: Specification of JSP method representing the progression of note-chord and chord-section.
  • Figure 3: Illustration of backbone networks in the denoising.
  • Figure 4: Example of fragmentation difference between FSL-v1 (upper) and FSL-v2 (lower).
  • Figure 5: Comparison of samples generated using different backbones.
  • ...and 3 more figures