MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion

Shulei Ji; Xinyu Yang

MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion

Shulei Ji, Xinyu Yang

TL;DR

MusER addresses the challenge of generating symbolic music with controllable emotion by disentangling discrete musical elements in the latent space. It introduces a musical element disentanglement module (MED) with a latent-regularization objective and a two-level decoding strategy (TD) built on a VQ-VAE backbone, enabling element-level control and emotion transfer. Visualization and experiments show a disentangled latent space and superior performance over prior models in both objective metrics and subjective listening tests, with successful element-transfer demonstrations that can alter arousal ($A$) and valence ($V$). The approach offers a practical path toward fine-grained, emotion-aware music generation and manipulation, with potential extensions to other discrete sequential domains.

Abstract

Generating music with emotion is an important task in automatic music generation, in which emotion is evoked through a variety of musical elements (such as pitch and duration) that change over time and collaborate with each other. However, prior research on deep learning-based emotional music generation has rarely explored the contribution of different musical elements to emotions, let alone the deliberate manipulation of these elements to alter the emotion of music, which is not conducive to fine-grained element-level control over emotions. To address this gap, we present a novel approach employing musical element-based regularization in the latent space to disentangle distinct elements, investigate their roles in distinguishing emotions, and further manipulate elements to alter musical emotions. Specifically, we propose a novel VQ-VAE-based model named MusER. MusER incorporates a regularization loss to enforce the correspondence between the musical element sequences and the specific dimensions of latent variable sequences, providing a new solution for disentangling discrete sequences. Taking advantage of the disentangled latent vectors, a two-level decoding strategy that includes multiple decoders attending to latent vectors with different semantics is devised to better predict the elements. By visualizing latent space, we conclude that MusER yields a disentangled and interpretable latent space and gain insights into the contribution of distinct elements to the emotional dimensions (i.e., arousal and valence). Experimental results demonstrate that MusER outperforms the state-of-the-art models for generating emotional music in both objective and subjective evaluation. Besides, we rearrange music through element transfer and attempt to alter the emotion of music by transferring emotion-distinguishable elements.

MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion

TL;DR

) and valence (

). The approach offers a practical path toward fine-grained, emotion-aware music generation and manipulation, with potential extensions to other discrete sequential domains.

Abstract

Paper Structure (35 sections, 8 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 8 equations, 12 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Emotion-Conditioned Music Generation
Interpretable Latent Representation Learning
Methodology
Background
Music Representation.
VQ-VAE.
Musical Element-Based Regularization
Two-level Decoding
Training and Inference
Training Objective.
Inference.
Latent Space Visualization
Experiments
...and 20 more sections

Figures (12)

Figure 1: Schematic diagrams for understanding emotions.
Figure 1: T-SNE visualization of element-specific latent space
Figure 2: The architecture of MusER consisting of four components: music representation and encoding, vector quantization, musical element disentanglement (MED), and two-level decoding (TD). DR is the acronym for dimensionality reduction. $\mathrm{Emb(o)}$ denotes the emotion embedding. The gradient $\nabla_{\boldsymbol{z}}L$ (in red) is passed unaltered to the encoder during the backwards pass. The dashed box indicates that a conditional autoregressive model is trained to predict discrete codes during inference.
Figure 2: Musical element distributions of models with different configurations. For comparison, we provide again the element distributions of real music, i.e., EMOPIA.
Figure 3: The schematic illustration of the regularization for element $\epsilon$. DR represents dimensionality reduction. Dec denotes the decoding module. HTML]FFF2CCsubtract means the subtraction of entries at the same position in two sequences, and $\varDelta$ denotes the result of the subtraction. HTML]FFF2CCclose indicates that two vectors or matrices are expected to be as close as possible.
...and 7 more figures

MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion

TL;DR

Abstract

MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion

Authors

TL;DR

Abstract

Table of Contents

Figures (12)