Table of Contents
Fetching ...

Expressive Music Data Processing and Generation

Jingwei Liu

TL;DR

This work addresses preserving musical expressivity in AI-generated music by combining listening-based data processing with interdependent multi-argument modeling. Perceptual categorization based on Weber's law yields perceptually uniform time, duration, and velocity-change classes, and the multi-argument dependence is captured by a chain-rule factorization across five outputs implemented as five autoregressive LSTM submodels with attention. Additionally, the authors introduce an entropy-sequence criterion to screen generated sequences, linking stability and predictability to the notion of informational aesthetics via measures like mutual information $I(X,Y)=H(Y)-H(Y|X)$ and moving-average variance. Together, these mechanisms improve coherence and expressivity in symbolic piano generation and provide a framework for future reinforcement-learning extensions.

Abstract

Musical expressivity and coherence are indispensable in music composition and performance, while often neglected in modern AI generative models. In this work, we introduce a listening-based data-processing technique that captures the expressivity in musical performance. This technique derived from Weber's law reflects the human perceptual truth of listening and preserves musical subtlety and expressivity in the training input. To facilitate musical coherence, we model the output interdependencies among multiple arguments in the music data such as pitch, duration, velocity, etc. in the neural networks based on the probabilistic chain rule. In practice, we decompose the multi-output sequential model into single-output submodels and condition previously sampled outputs on the subsequent submodels to induce conditional distributions. Finally, to select eligible sequences from all generations, a tentative measure based on the output entropy was proposed. The entropy sequence is set as a criterion to select predictable and stable generations, which is further studied under the context of informational aesthetic measures to quantify musical pleasure and information gain along the music tendency.

Expressive Music Data Processing and Generation

TL;DR

This work addresses preserving musical expressivity in AI-generated music by combining listening-based data processing with interdependent multi-argument modeling. Perceptual categorization based on Weber's law yields perceptually uniform time, duration, and velocity-change classes, and the multi-argument dependence is captured by a chain-rule factorization across five outputs implemented as five autoregressive LSTM submodels with attention. Additionally, the authors introduce an entropy-sequence criterion to screen generated sequences, linking stability and predictability to the notion of informational aesthetics via measures like mutual information and moving-average variance. Together, these mechanisms improve coherence and expressivity in symbolic piano generation and provide a framework for future reinforcement-learning extensions.

Abstract

Musical expressivity and coherence are indispensable in music composition and performance, while often neglected in modern AI generative models. In this work, we introduce a listening-based data-processing technique that captures the expressivity in musical performance. This technique derived from Weber's law reflects the human perceptual truth of listening and preserves musical subtlety and expressivity in the training input. To facilitate musical coherence, we model the output interdependencies among multiple arguments in the music data such as pitch, duration, velocity, etc. in the neural networks based on the probabilistic chain rule. In practice, we decompose the multi-output sequential model into single-output submodels and condition previously sampled outputs on the subsequent submodels to induce conditional distributions. Finally, to select eligible sequences from all generations, a tentative measure based on the output entropy was proposed. The entropy sequence is set as a criterion to select predictable and stable generations, which is further studied under the context of informational aesthetic measures to quantify musical pleasure and information gain along the music tendency.

Paper Structure

This paper contains 9 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Categorical Distributions of Time Shift, Duration, and Velocity Change. The divisions are co-determined with Weber's law where the perceptual changes are proportional to current values, and the ground truth statistics to balance the data distributions in each of the training classes.
  • Figure 2: A Way to Capture Interdependency in a Multi-argument Sequential Model. The interdependencies are modeled with probabilistic conditioning and the multi-argument output model is decomposed into separate sequential submodels with a single output.
  • Figure 3: Attention Score for Five Sequential Models. The weights are computed from the neural network parameters related to each input field of the models.
  • Figure 4: Statistics of Entropy Sequence. This figure presents the mean, variance, and moving average variance distributions of the data and generation entropy sequences from the five sequential models in Figure \ref{['fig: LSTM5']}.