Table of Contents
Fetching ...

Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation

Jingyue Huang, Ke Chen, Yi-Hsuan Yang

TL;DR

This work tackles emotion-controlled piano music generation by separating valence and arousal into a two-stage pipeline: a lead-sheet stage models valence, while a performance stage controls arousal. It introduces a functional representation that encodes melody and chords relative to key via Roman numerals, capturing interactions among notes, chords, and tonality. The approach is pretrain-finetune trained on large unlabeled datasets and emotion-labeled EMOPIA data, with key data curation to ensure quality. Experiments show improvements in objective key-consistency and subjective judgments of valence, arousal, and 4Q emotion classification, demonstrating enhanced controllability and potential for applications in music therapy, scoring, and media synchronization.

Abstract

Managing the emotional aspect remains a challenge in automatic music generation. Prior works aim to learn various emotions at once, leading to inadequate modeling. This paper explores the disentanglement of emotions in piano performance generation through a two-stage framework. The first stage focuses on valence modeling of lead sheet, and the second stage addresses arousal modeling by introducing performance-level attributes. To further capture features that shape valence, an aspect less explored by previous approaches, we introduce a novel functional representation of symbolic music. This representation aims to capture the emotional impact of major-minor tonality, as well as the interactions among notes, chords, and key signatures. Objective and subjective experiments validate the effectiveness of our framework in both emotional valence and arousal modeling. We further leverage our framework in a novel application of emotional controls, showing a broad potential in emotion-driven music generation.

Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation

TL;DR

This work tackles emotion-controlled piano music generation by separating valence and arousal into a two-stage pipeline: a lead-sheet stage models valence, while a performance stage controls arousal. It introduces a functional representation that encodes melody and chords relative to key via Roman numerals, capturing interactions among notes, chords, and tonality. The approach is pretrain-finetune trained on large unlabeled datasets and emotion-labeled EMOPIA data, with key data curation to ensure quality. Experiments show improvements in objective key-consistency and subjective judgments of valence, arousal, and 4Q emotion classification, demonstrating enhanced controllability and potential for applications in music therapy, scoring, and media synchronization.

Abstract

Managing the emotional aspect remains a challenge in automatic music generation. Prior works aim to learn various emotions at once, leading to inadequate modeling. This paper explores the disentanglement of emotions in piano performance generation through a two-stage framework. The first stage focuses on valence modeling of lead sheet, and the second stage addresses arousal modeling by introducing performance-level attributes. To further capture features that shape valence, an aspect less explored by previous approaches, we introduce a novel functional representation of symbolic music. This representation aims to capture the emotional impact of major-minor tonality, as well as the interactions among notes, chords, and key signatures. Objective and subjective experiments validate the effectiveness of our framework in both emotional valence and arousal modeling. We further leverage our framework in a novel application of emotional controls, showing a broad potential in emotion-driven music generation.
Paper Structure (26 sections, 2 equations, 8 figures, 2 tables)

This paper contains 26 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Key histogram of high/low valence clips from the emotion-labeled piano music dataset EMOPIA emopia.
  • Figure 2: Illustration of (a) REMI remi, (b) the proposed functional representation, and their differences.
  • Figure 3: The conversion between letters and Roman numerals in the cases of C major and c minor scales. Solid arrows denote strict one-to-one conversions, and dotted arrows denote optional one-to-either conversions.
  • Figure 4: Two lead sheet examples from different songs in EMOPIA. In our functional representation, they have the same melody events (green), but different chord events (yellow) by different emotions (Positive and Negative by pink) and keys (D major or c minor by purple).
  • Figure 5: The two-stage framework of emotion-driven piano performance generation. Squares with transparent background denote the tokens that are not included in the loss computation during the training phase.
  • ...and 3 more figures